In this work, we consider the contextual multi-armed bandit problem in the presence of strategic agents in the context of sponsored search auction. In this setting, an advertising platform (center) runs a repeated auction to select the best-suited ads relevant to the user’s query. The center aspires to select an ad that has a high expected value that depends on the probability of getting a click and the value it derives from a click of the ad. The probability of getting a click (CTR) is unknown to the center and depends on the user’s profile (context) posting the query. Further, the value derived for a click is the private information to the advertiser and needs to be elicited truthfully.
In the presence of strategic agents, random context-arrivals, and stochastic clicks, our goal is to design a non-exploration-separated, ex-post truthful mechanism that (i) learns CTRs efficiently (minimizes regret), (ii) may not need prior knowledge of T, and (iii) does not have free rounds. We leverage popular algorithms LinUCB and SupLinUCB that perform well in estimating CTRs in the contextual setting to build our randomized mechanisms to avoid manipulations by strategic agents.
We first design an elimination-based algorithm ELinUCB-SB that is ex-post monotone, which is a sufficient condition for truthfulness. Note that the elimination technique is crucial for the ex-post monotonicity property. The intuition behind the algorithm is that after sufficient exploration the confidence interval becomes sufficiently small, and we can distinguish between optimal and sub-optimal agents with high probability. Hence, we can eliminate the sub-optimal arms. Thus, ELinUCB-SB can be naturally extended to ex-post incentive compatible and ex-post individually rational mechanism M-ELinUCB-SB. We show via simulations that M-ELinUCB-SB outperforms the existing mechanism in this setting. Theoretically, however, the mechanism may incur linear regret in non-frequent instances. We then propose a theoretically stronger mechanism, M-SupLinUCB-S. We prove that M-SupLinUCB-S is ex-post incentive compatible, ex-post
individually rational, and achieves sub-linear regret.
Along with SSA, this work can form a baseline for other applications such as
crowdsourcing, smart grids, where similar settings arise to learn the stochastic parameters in the presence of strategic agents and side information (context) about the agents. It will be interesting to extend the work in the setting where the change in valuations and bids are permissible.