skip to main content
research-article
Public Access

Bandits with Knapsacks

Published:01 March 2018Publication History
Skip Abstract Section

Abstract

Multi-armed bandit problems are the predominant theoretical model of exploration-exploitation tradeoffs in learning, and they have countless applications ranging from medical trials, to communication networks, to Web search and advertising. In many of these application domains, the learner may be constrained by one or more supply (or budget) limits, in addition to the customary limitation on the time horizon. The literature lacks a general model encompassing these sorts of problems. We introduce such a model, called bandits with knapsacks, that combines bandit learning with aspects of stochastic integer programming. In particular, a bandit algorithm needs to solve a stochastic version of the well-known knapsack problem, which is concerned with packing items into a limited-size knapsack. A distinctive feature of our problem, in comparison to the existing regret-minimization literature, is that the optimal policy for a given latent distribution may significantly outperform the policy that plays the optimal fixed arm. Consequently, achieving sublinear regret in the bandits-with-knapsacks problem is significantly more challenging than in conventional bandit problems.

We present two algorithms whose reward is close to the information-theoretic optimum: one is based on a novel “balanced exploration” paradigm, while the other is a primal-dual algorithm that uses multiplicative updates. Further, we prove that the regret achieved by both algorithms is optimal up to polylogarithmic factors. We illustrate the generality of the problem by presenting applications in a number of different domains, including electronic commerce, routing, and scheduling. As one example of a concrete application, we consider the problem of dynamic posted pricing with limited supply and obtain the first algorithm whose regret, with respect to the optimal dynamic policy, is sublinear in the supply.

References

  1. Ittai Abraham, Omar Alonso, Vasilis Kandylas, and Aleksandrs Slivkins. 2013. Adaptive crowdsourcing algorithms for the bandit survey problem. In Proceedings of the 26th COLT.Google ScholarGoogle Scholar
  2. Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. 2014. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. In Proceedings of the 31st ICML. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Shipra Agrawal and Nikhil R. Devanur. 2014. Bandits with concave rewards and convex knapsacks. In Proceedings of the 15th ACM EC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Shipra Agrawal and Nikhil R. Devanur. 2016. Linear Contextual Bandits with Knapsacks. In Proceedings of the 29th NIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Shipra Agrawal, Zizhuo Wang, and Yinyu Ye. 2014. A Dynamic Near-Optimal Algorithm for Online Linear Programming. Oper. Res. 62, 4 (2014), 876--890. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Shipra Agrawal, Nikhil R. Devanur, and Lihong Li. 2016. An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. In Proceedings of the 29th COLT.Google ScholarGoogle Scholar
  7. Kareem Amin, Michael Kearns, Peter Key, and Anton Schwaighofer. 2012. Budget Optimization for Sponsored Search: Censored Learning in MDPs. In Proceedings of the 28th UAI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Sanjeev Arora, Elad Hazan, and Satyen Kale. 2012. The Multiplicative Weights Update Method: a Meta-Algorithm and Applications. Theory Comput. 8, 1 (2012), 121--164.Google ScholarGoogle ScholarCross RefCross Ref
  9. Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. 2002. Finite-time Analysis of the Multiarmed Bandit Problem. Mach. Learn. 47, 2--3 (2002), 235--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. 2002. The Nonstochastic Multiarmed Bandit Problem. SIAM J. Comput. 32, 1 (2002), 48--77. Preliminary version in Proceedings of the 36th IEEE FOCS, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Moshe Babaioff, Yogeshwer Sharma, and Aleksandrs Slivkins. 2014. Characterizing Truthful Multi-armed Bandit Mechanisms. SIAM J. Comput. 43, 1 (2014), 194--230. Preliminary version in Proceedings of the 10th ACM EC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Moshe Babaioff, Shaddin Dughmi, Robert D. Kleinberg, and Aleksandrs Slivkins. 2015. Dynamic Pricing with Limited Supply. ACM Trans. Econ. Comput. 3, 1 (2015), 4. Special issue for Proceedings of the 13th ACM EC, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ashwinkumar Badanidiyuru, Robert Kleinberg, and Yaron Singer. 2012. Learning on a budget: posted price mechanisms for online procurement. In Proceedings of the 13th ACM EC. 128--145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. 2013. Bandits with Knapsacks. In Proceedings of the 54th IEEE FOCS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ashwinkumar Badanidiyuru, John Langford, and Aleksandrs Slivkins. 2014. Resourceful Contextual Bandits. In Proceedings of the 27th COLT.Google ScholarGoogle Scholar
  16. Omar Besbes and Assaf Zeevi. 2009. Dynamic Pricing Without Knowing the Demand Function: Risk Bounds and Near-Optimal Algorithms. Oper. Res. 57 (2009), 1407--1420. Issue 6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Omar Besbes and Assaf J. Zeevi. 2012. Blind Network Revenue Management. Oper. Res. 60, 6 (2012), 1537--1550.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Avrim Blum, Vijay Kumar, Atri Rudra, and Felix Wu. 2003. Online learning in online auctions. In Proceedings of the 14th ACM-SIAM SODA. 202--204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Arnoud V. Den Boer. 2015. Dynamic pricing and learning: Historical origins, current research, and new directions. Surv. Oper. Res. Manage. Sci. 20, 1 (June 2015).Google ScholarGoogle ScholarCross RefCross Ref
  20. Sébastien Bubeck and Nicolo Cesa-Bianchi. 2012. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Found.Trends Mach. Learn. 5, 1 (2012).Google ScholarGoogle Scholar
  21. Nicoló Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. 2013. Regret Minimization for Reserve Prices in Second-Price Auctions. In Proceedings of the ACM-SIAM SODA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. 2011. Contextual Bandits with Linear Payoff Functions. In Proceedings of the 14th AISTATS.Google ScholarGoogle Scholar
  23. Varsha Dani, Thomas P. Hayes, and Sham Kakade. 2008. Stochastic Linear Optimization under Bandit Feedback. In Proceedings of the 21th COLT. 355--366.Google ScholarGoogle Scholar
  24. Nikhil Devanur and Sham M. Kakade. 2009. The Price of Truthfulness for Pay-Per-Click Auctions. In Proceedings of the 10th ACM EC. 99--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Nikhil R. Devanur and Thomas P. Hayes. 2009. The AdWords problem: Online keyword matching with budgeted bidders under random permutations. In Proceedings of the 10th ACM EC. 71--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Nikhil R. Devanur, Kamal Jain, Balasubramanian Sivan, and Christopher A. Wilkens. 2011. Near optimal online algorithms and fast approximation algorithms for resource allocation problems. In Proceedings of the 12th ACM EC. 29--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Wenkui Ding, Tao Qin, Xu-Dong Zhang, and Tie-Yan Liu. 2013. Multi-Armed Bandit with Budget Constraint and Variable Costs. In Proceedings of the 27th AAAI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Miroslav Dudíik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. 2011. Efficient Optimal Leanring for Contextual Bandits. In Proceedings of the 27th UAI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Eyal Even-Dar, Shie Mannor, and Yishay Mansour. 2002. PAC bounds for multi-armed bandit and Markov decision processes. In Proceedings of the 15th COLT. 255--270. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jon Feldman, Monika Henzinger, Nitish Korula, Vahab S. Mirrokni, and Clifford Stein. 2010. Online Stochastic Packing Applied to Display Ad Allocation. In Proceedings of the 18th ESA. 182--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Yoav Freund and Robert E. Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (1997), 119--139. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Naveen Garg and Jochen Könemann. 2007. Faster and simpler algorithms for multicommodity flow and other fractional packing problems. SIAM J. Comput. 37, 2 (2007), 630--652. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Sudipta Guha and Kamesh Munagala. 2007. Multi-armed Bandits with Metric Switching Costs. In Proceedings of the 36th ICALP. 496--507. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Anupam Gupta, Ravishankar Krishnaswamy, Marco Molinaro, and R. Ravi. 2011. Approximation Algorithms for Correlated Knapsacks and Non-martingale Bandits. In Proceedings of the 52nd IEEE FOCS. 827--836. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. András György, Levente Kocsis, Ivett Szabó, and Csaba Szepesvári. 2007. Continuous Time Associative Bandit Problems. In Proceedings of the 20th IJCAI. 830--835. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Elad Hazan and Nimrod Megiddo. 2007. Online Learning with Prior Information. In Proceedings of the 20th COLT. 499--513. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Robert Kleinberg. 2004. Nearly Tight Bounds for the Continuum-Armed Bandit Problem. In Proceedings of the 18th NIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Robert Kleinberg. 2007. Lecture notes for CS 683 (week 2), Cornell University. Retrieved from http://www.cs.cornell.edu/courses/cs683/2007sp/lecnotes/week2.pdf.Google ScholarGoogle Scholar
  39. Robert Kleinberg and Tom Leighton. 2003. The Value of Knowing a Demand Curve: Bounds on Regret for Online Posted-Price Auctions. In Proceedings of the 44th IEEE FOCS. 594--605. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Robert Kleinberg and Aleksandrs Slivkins. 2010. Sharp Dichotomies for Regret Minimization in Metric Spaces. In Proceedings of the 21st ACM-SIAM SODA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. 2008. Multi-Armed Bandits in Metric Spaces. In Proceedings of the 40th ACM STOC. 681--690. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Tze Leung Lai and Herbert Robbins. 1985. Asymptotically efficient Adaptive Allocation Rules. Adv. Appl. Math. 6 (1985), 4--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. John Langford and Tong Zhang. 2007. The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits. In Proceedings of the 21st NIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Nick Littlestone and Manfred K. Warmuth. 1994. The Weighted Majority Algorithm. Info. Comput. 108, 2 (1994), 212--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Tyler Lu, Dávid Pál, and Martin Pál. 2010. Showing Relevant Ads via Lipschitz Context Multi-Armed Bandits. In Proceedings of the 14th AISTATS.Google ScholarGoogle Scholar
  46. Marco Molinaro and R. Ravi. 2012. Geometry of Online Packing Linear Programs. In Proceedings of the 39th ICALP. 701--713. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Sandeep Pandey, Deepak Agarwal, Deepayan Chakrabarti, and Vanja Josifovski. 2007. Bandits for Taxonomies: A Model-based Approach. In Proceedings of the SDM.Google ScholarGoogle ScholarCross RefCross Ref
  48. Sandeep Pandey, Deepayan Chakrabarti, and Deepak Agarwal. 2007. Multi-armed Bandit Problems with Dependent Arms. In Proceedings of the 24th ICML. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Christos H. Papadimitriou and John N. Tsitsiklis. 1999. The complexity of optimal queuing network control. Math. Oper. Res. 24, 2 (1999), 293--305. Google ScholarGoogle ScholarCross RefCross Ref
  50. Serge A. Plotkin, David B. Shmoys, and Eva Tardos. 1995. Fast Approximation Algorithms for Fractional Packing and Covering Problems. Math. Oper. Res. 20 (1995), 257--301.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Adish Singla and Andreas Krause. 2013. Truthful incentives in crowdsourcing tasks using regret minimization mechanisms. In Proceedings of the 22nd WWW. 1167--1178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Aleksandrs Slivkins and Jennifer Wortman Vaughan. 2013. Online Decision Making in Crowdsourcing Markets: Theoretical Challenges. SIGecom Exch. 12, 2 (December 2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Long Tran-Thanh, Archie Chapman, Enrique Munoz de Cote, Alex Rogers, and Nicholas R. Jennings. 2010. ε-first policies for budget-limited multi-armed bandits. In Proceedings of the 24th AAAI. 1211--1216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Long Tran-Thanh, Archie Chapman, Alex Rogers, and Nicholas R. Jennings. 2012. Knapsack based optimal policies for budget-limited multi-armed bandits. In Proceedings of the 26th AAAI. 1134--1140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Long Tran-Thanh, Lampros C. Stavrogiannis, Victor Naroditskiy, Valentin Robu, Nicholas R. Jennings, and Peter Key. 2014. Efficient regret bounds for online bid optimisation in budget-limited sponsored search auctions. In Proceedings of the 30th UAI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Zizhuo Wang, Shiming Deng, and Yinyu Ye. 2014. Close the Gaps: A Learning-While-Doing Algorithm for Single-Product Revenue Management Problems. Oper. Res. 62, 2 (2014), 318--331. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Peter Whittle. 1980. Multi-armed Bandits and the Gittins Index. J. Roy. Stat. Soc., Ser. B 42, 2 (1980), 143--149.Google ScholarGoogle Scholar

Index Terms

  1. Bandits with Knapsacks

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image Journal of the ACM
              Journal of the ACM  Volume 65, Issue 3
              June 2018
              285 pages
              ISSN:0004-5411
              EISSN:1557-735X
              DOI:10.1145/3191817
              Issue’s Table of Contents

              Copyright © 2018 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 1 March 2018
              • Accepted: 1 November 2017
              • Revised: 1 September 2017
              • Received: 1 July 2015
              Published in jacm Volume 65, Issue 3

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader