Abstract
Multi-armed bandit problems are the predominant theoretical model of exploration-exploitation tradeoffs in learning, and they have countless applications ranging from medical trials, to communication networks, to Web search and advertising. In many of these application domains, the learner may be constrained by one or more supply (or budget) limits, in addition to the customary limitation on the time horizon. The literature lacks a general model encompassing these sorts of problems. We introduce such a model, called bandits with knapsacks, that combines bandit learning with aspects of stochastic integer programming. In particular, a bandit algorithm needs to solve a stochastic version of the well-known knapsack problem, which is concerned with packing items into a limited-size knapsack. A distinctive feature of our problem, in comparison to the existing regret-minimization literature, is that the optimal policy for a given latent distribution may significantly outperform the policy that plays the optimal fixed arm. Consequently, achieving sublinear regret in the bandits-with-knapsacks problem is significantly more challenging than in conventional bandit problems.
We present two algorithms whose reward is close to the information-theoretic optimum: one is based on a novel “balanced exploration” paradigm, while the other is a primal-dual algorithm that uses multiplicative updates. Further, we prove that the regret achieved by both algorithms is optimal up to polylogarithmic factors. We illustrate the generality of the problem by presenting applications in a number of different domains, including electronic commerce, routing, and scheduling. As one example of a concrete application, we consider the problem of dynamic posted pricing with limited supply and obtain the first algorithm whose regret, with respect to the optimal dynamic policy, is sublinear in the supply.
- Ittai Abraham, Omar Alonso, Vasilis Kandylas, and Aleksandrs Slivkins. 2013. Adaptive crowdsourcing algorithms for the bandit survey problem. In Proceedings of the 26th COLT.Google Scholar
- Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. 2014. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. In Proceedings of the 31st ICML. Google ScholarDigital Library
- Shipra Agrawal and Nikhil R. Devanur. 2014. Bandits with concave rewards and convex knapsacks. In Proceedings of the 15th ACM EC. Google ScholarDigital Library
- Shipra Agrawal and Nikhil R. Devanur. 2016. Linear Contextual Bandits with Knapsacks. In Proceedings of the 29th NIPS. Google ScholarDigital Library
- Shipra Agrawal, Zizhuo Wang, and Yinyu Ye. 2014. A Dynamic Near-Optimal Algorithm for Online Linear Programming. Oper. Res. 62, 4 (2014), 876--890. Google ScholarDigital Library
- Shipra Agrawal, Nikhil R. Devanur, and Lihong Li. 2016. An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. In Proceedings of the 29th COLT.Google Scholar
- Kareem Amin, Michael Kearns, Peter Key, and Anton Schwaighofer. 2012. Budget Optimization for Sponsored Search: Censored Learning in MDPs. In Proceedings of the 28th UAI. Google ScholarDigital Library
- Sanjeev Arora, Elad Hazan, and Satyen Kale. 2012. The Multiplicative Weights Update Method: a Meta-Algorithm and Applications. Theory Comput. 8, 1 (2012), 121--164.Google ScholarCross Ref
- Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. 2002. Finite-time Analysis of the Multiarmed Bandit Problem. Mach. Learn. 47, 2--3 (2002), 235--256. Google ScholarDigital Library
- Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. 2002. The Nonstochastic Multiarmed Bandit Problem. SIAM J. Comput. 32, 1 (2002), 48--77. Preliminary version in Proceedings of the 36th IEEE FOCS, 1995. Google ScholarDigital Library
- Moshe Babaioff, Yogeshwer Sharma, and Aleksandrs Slivkins. 2014. Characterizing Truthful Multi-armed Bandit Mechanisms. SIAM J. Comput. 43, 1 (2014), 194--230. Preliminary version in Proceedings of the 10th ACM EC, 2009. Google ScholarDigital Library
- Moshe Babaioff, Shaddin Dughmi, Robert D. Kleinberg, and Aleksandrs Slivkins. 2015. Dynamic Pricing with Limited Supply. ACM Trans. Econ. Comput. 3, 1 (2015), 4. Special issue for Proceedings of the 13th ACM EC, 2012. Google ScholarDigital Library
- Ashwinkumar Badanidiyuru, Robert Kleinberg, and Yaron Singer. 2012. Learning on a budget: posted price mechanisms for online procurement. In Proceedings of the 13th ACM EC. 128--145. Google ScholarDigital Library
- Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. 2013. Bandits with Knapsacks. In Proceedings of the 54th IEEE FOCS. Google ScholarDigital Library
- Ashwinkumar Badanidiyuru, John Langford, and Aleksandrs Slivkins. 2014. Resourceful Contextual Bandits. In Proceedings of the 27th COLT.Google Scholar
- Omar Besbes and Assaf Zeevi. 2009. Dynamic Pricing Without Knowing the Demand Function: Risk Bounds and Near-Optimal Algorithms. Oper. Res. 57 (2009), 1407--1420. Issue 6. Google ScholarDigital Library
- Omar Besbes and Assaf J. Zeevi. 2012. Blind Network Revenue Management. Oper. Res. 60, 6 (2012), 1537--1550.Google ScholarDigital Library
- Avrim Blum, Vijay Kumar, Atri Rudra, and Felix Wu. 2003. Online learning in online auctions. In Proceedings of the 14th ACM-SIAM SODA. 202--204. Google ScholarDigital Library
- Arnoud V. Den Boer. 2015. Dynamic pricing and learning: Historical origins, current research, and new directions. Surv. Oper. Res. Manage. Sci. 20, 1 (June 2015).Google ScholarCross Ref
- Sébastien Bubeck and Nicolo Cesa-Bianchi. 2012. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Found.Trends Mach. Learn. 5, 1 (2012).Google Scholar
- Nicoló Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. 2013. Regret Minimization for Reserve Prices in Second-Price Auctions. In Proceedings of the ACM-SIAM SODA. Google ScholarDigital Library
- Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. 2011. Contextual Bandits with Linear Payoff Functions. In Proceedings of the 14th AISTATS.Google Scholar
- Varsha Dani, Thomas P. Hayes, and Sham Kakade. 2008. Stochastic Linear Optimization under Bandit Feedback. In Proceedings of the 21th COLT. 355--366.Google Scholar
- Nikhil Devanur and Sham M. Kakade. 2009. The Price of Truthfulness for Pay-Per-Click Auctions. In Proceedings of the 10th ACM EC. 99--106. Google ScholarDigital Library
- Nikhil R. Devanur and Thomas P. Hayes. 2009. The AdWords problem: Online keyword matching with budgeted bidders under random permutations. In Proceedings of the 10th ACM EC. 71--78. Google ScholarDigital Library
- Nikhil R. Devanur, Kamal Jain, Balasubramanian Sivan, and Christopher A. Wilkens. 2011. Near optimal online algorithms and fast approximation algorithms for resource allocation problems. In Proceedings of the 12th ACM EC. 29--38. Google ScholarDigital Library
- Wenkui Ding, Tao Qin, Xu-Dong Zhang, and Tie-Yan Liu. 2013. Multi-Armed Bandit with Budget Constraint and Variable Costs. In Proceedings of the 27th AAAI. Google ScholarDigital Library
- Miroslav Dudíik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. 2011. Efficient Optimal Leanring for Contextual Bandits. In Proceedings of the 27th UAI. Google ScholarDigital Library
- Eyal Even-Dar, Shie Mannor, and Yishay Mansour. 2002. PAC bounds for multi-armed bandit and Markov decision processes. In Proceedings of the 15th COLT. 255--270. Google ScholarDigital Library
- Jon Feldman, Monika Henzinger, Nitish Korula, Vahab S. Mirrokni, and Clifford Stein. 2010. Online Stochastic Packing Applied to Display Ad Allocation. In Proceedings of the 18th ESA. 182--194. Google ScholarDigital Library
- Yoav Freund and Robert E. Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (1997), 119--139. Google ScholarDigital Library
- Naveen Garg and Jochen Könemann. 2007. Faster and simpler algorithms for multicommodity flow and other fractional packing problems. SIAM J. Comput. 37, 2 (2007), 630--652. Google ScholarDigital Library
- Sudipta Guha and Kamesh Munagala. 2007. Multi-armed Bandits with Metric Switching Costs. In Proceedings of the 36th ICALP. 496--507. Google ScholarDigital Library
- Anupam Gupta, Ravishankar Krishnaswamy, Marco Molinaro, and R. Ravi. 2011. Approximation Algorithms for Correlated Knapsacks and Non-martingale Bandits. In Proceedings of the 52nd IEEE FOCS. 827--836. Google ScholarDigital Library
- András György, Levente Kocsis, Ivett Szabó, and Csaba Szepesvári. 2007. Continuous Time Associative Bandit Problems. In Proceedings of the 20th IJCAI. 830--835. Google ScholarDigital Library
- Elad Hazan and Nimrod Megiddo. 2007. Online Learning with Prior Information. In Proceedings of the 20th COLT. 499--513. Google ScholarDigital Library
- Robert Kleinberg. 2004. Nearly Tight Bounds for the Continuum-Armed Bandit Problem. In Proceedings of the 18th NIPS. Google ScholarDigital Library
- Robert Kleinberg. 2007. Lecture notes for CS 683 (week 2), Cornell University. Retrieved from http://www.cs.cornell.edu/courses/cs683/2007sp/lecnotes/week2.pdf.Google Scholar
- Robert Kleinberg and Tom Leighton. 2003. The Value of Knowing a Demand Curve: Bounds on Regret for Online Posted-Price Auctions. In Proceedings of the 44th IEEE FOCS. 594--605. Google ScholarDigital Library
- Robert Kleinberg and Aleksandrs Slivkins. 2010. Sharp Dichotomies for Regret Minimization in Metric Spaces. In Proceedings of the 21st ACM-SIAM SODA. Google ScholarDigital Library
- Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. 2008. Multi-Armed Bandits in Metric Spaces. In Proceedings of the 40th ACM STOC. 681--690. Google ScholarDigital Library
- Tze Leung Lai and Herbert Robbins. 1985. Asymptotically efficient Adaptive Allocation Rules. Adv. Appl. Math. 6 (1985), 4--22. Google ScholarDigital Library
- John Langford and Tong Zhang. 2007. The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits. In Proceedings of the 21st NIPS. Google ScholarDigital Library
- Nick Littlestone and Manfred K. Warmuth. 1994. The Weighted Majority Algorithm. Info. Comput. 108, 2 (1994), 212--260. Google ScholarDigital Library
- Tyler Lu, Dávid Pál, and Martin Pál. 2010. Showing Relevant Ads via Lipschitz Context Multi-Armed Bandits. In Proceedings of the 14th AISTATS.Google Scholar
- Marco Molinaro and R. Ravi. 2012. Geometry of Online Packing Linear Programs. In Proceedings of the 39th ICALP. 701--713. Google ScholarDigital Library
- Sandeep Pandey, Deepak Agarwal, Deepayan Chakrabarti, and Vanja Josifovski. 2007. Bandits for Taxonomies: A Model-based Approach. In Proceedings of the SDM.Google ScholarCross Ref
- Sandeep Pandey, Deepayan Chakrabarti, and Deepak Agarwal. 2007. Multi-armed Bandit Problems with Dependent Arms. In Proceedings of the 24th ICML. Google ScholarDigital Library
- Christos H. Papadimitriou and John N. Tsitsiklis. 1999. The complexity of optimal queuing network control. Math. Oper. Res. 24, 2 (1999), 293--305. Google ScholarCross Ref
- Serge A. Plotkin, David B. Shmoys, and Eva Tardos. 1995. Fast Approximation Algorithms for Fractional Packing and Covering Problems. Math. Oper. Res. 20 (1995), 257--301.Google ScholarDigital Library
- Adish Singla and Andreas Krause. 2013. Truthful incentives in crowdsourcing tasks using regret minimization mechanisms. In Proceedings of the 22nd WWW. 1167--1178. Google ScholarDigital Library
- Aleksandrs Slivkins and Jennifer Wortman Vaughan. 2013. Online Decision Making in Crowdsourcing Markets: Theoretical Challenges. SIGecom Exch. 12, 2 (December 2013). Google ScholarDigital Library
- Long Tran-Thanh, Archie Chapman, Enrique Munoz de Cote, Alex Rogers, and Nicholas R. Jennings. 2010. ε-first policies for budget-limited multi-armed bandits. In Proceedings of the 24th AAAI. 1211--1216. Google ScholarDigital Library
- Long Tran-Thanh, Archie Chapman, Alex Rogers, and Nicholas R. Jennings. 2012. Knapsack based optimal policies for budget-limited multi-armed bandits. In Proceedings of the 26th AAAI. 1134--1140. Google ScholarDigital Library
- Long Tran-Thanh, Lampros C. Stavrogiannis, Victor Naroditskiy, Valentin Robu, Nicholas R. Jennings, and Peter Key. 2014. Efficient regret bounds for online bid optimisation in budget-limited sponsored search auctions. In Proceedings of the 30th UAI. Google ScholarDigital Library
- Zizhuo Wang, Shiming Deng, and Yinyu Ye. 2014. Close the Gaps: A Learning-While-Doing Algorithm for Single-Product Revenue Management Problems. Oper. Res. 62, 2 (2014), 318--331. Google ScholarDigital Library
- Peter Whittle. 1980. Multi-armed Bandits and the Gittins Index. J. Roy. Stat. Soc., Ser. B 42, 2 (1980), 143--149.Google Scholar
Index Terms
- Bandits with Knapsacks
Recommendations
Adversarial Bandits with Knapsacks
We consider Bandits with Knapsacks (henceforth, BwK), a general model for multi-armed bandits under supply/budget constraints. In particular, a bandit algorithm needs to solve a well-known knapsack problem: find an optimal packing of items into a limited-...
Bandits with concave rewards and convex knapsacks
EC '14: Proceedings of the fifteenth ACM conference on Economics and computationIn this paper, we consider a very general model for exploration-exploitation tradeoff which allows arbitrary concave rewards and convex constraints on the decisions across time, in addition to the customary limitation on the time horizon. This model ...
Bandits with Knapsacks
FOCS '13: Proceedings of the 2013 IEEE 54th Annual Symposium on Foundations of Computer ScienceMulti-armed bandit problems are the predominant theoretical model of exploration-exploitation tradeoffs in learning, and they have countless applications ranging from medical trials, to communication networks, to Web search and advertising. In many of ...
Comments