Bandits with Knapsacks

Authors:
Ashwinkumar Badanidiyuru

Google Research, CA, USA

Google Research, CA, USA
View Profile

,
Robert Kleinberg

Cornell University, Ithaca, NY, USA

Cornell University, Ithaca, NY, USA
View Profile

,
Aleksandrs Slivkins

Microsoft Research, NY, USA

Microsoft Research, NY, USA
View Profile

Authors Info & Claims

Journal of the ACM Volume 65 Issue 3Article No.: 13pp 1–55https://doi.org/10.1145/3164539

Published:01 March 2018Publication History

Journal of the ACM

Abstract

Multi-armed bandit problems are the predominant theoretical model of exploration-exploitation tradeoffs in learning, and they have countless applications ranging from medical trials, to communication networks, to Web search and advertising. In many of these application domains, the learner may be constrained by one or more supply (or budget) limits, in addition to the customary limitation on the time horizon. The literature lacks a general model encompassing these sorts of problems. We introduce such a model, called bandits with knapsacks, that combines bandit learning with aspects of stochastic integer programming. In particular, a bandit algorithm needs to solve a stochastic version of the well-known knapsack problem, which is concerned with packing items into a limited-size knapsack. A distinctive feature of our problem, in comparison to the existing regret-minimization literature, is that the optimal policy for a given latent distribution may significantly outperform the policy that plays the optimal fixed arm. Consequently, achieving sublinear regret in the bandits-with-knapsacks problem is significantly more challenging than in conventional bandit problems.

We present two algorithms whose reward is close to the information-theoretic optimum: one is based on a novel “balanced exploration” paradigm, while the other is a primal-dual algorithm that uses multiplicative updates. Further, we prove that the regret achieved by both algorithms is optimal up to polylogarithmic factors. We illustrate the generality of the problem by presenting applications in a number of different domains, including electronic commerce, routing, and scheduling. As one example of a concrete application, we consider the problem of dynamic posted pricing with limited supply and obtain the first algorithm whose regret, with respect to the optimal dynamic policy, is sublinear in the supply.

References

Ittai Abraham, Omar Alonso, Vasilis Kandylas, and Aleksandrs Slivkins. 2013. Adaptive crowdsourcing algorithms for the bandit survey problem. In Proceedings of the 26th COLT.Google Scholar
Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. 2014. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. In Proceedings of the 31st ICML. Google ScholarDigital Library
Shipra Agrawal and Nikhil R. Devanur. 2014. Bandits with concave rewards and convex knapsacks. In Proceedings of the 15th ACM EC. Google ScholarDigital Library
Shipra Agrawal and Nikhil R. Devanur. 2016. Linear Contextual Bandits with Knapsacks. In Proceedings of the 29th NIPS. Google ScholarDigital Library
Shipra Agrawal, Zizhuo Wang, and Yinyu Ye. 2014. A Dynamic Near-Optimal Algorithm for Online Linear Programming. Oper. Res. 62, 4 (2014), 876--890. Google ScholarDigital Library
Shipra Agrawal, Nikhil R. Devanur, and Lihong Li. 2016. An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. In Proceedings of the 29th COLT.Google Scholar
Kareem Amin, Michael Kearns, Peter Key, and Anton Schwaighofer. 2012. Budget Optimization for Sponsored Search: Censored Learning in MDPs. In Proceedings of the 28th UAI. Google ScholarDigital Library
Sanjeev Arora, Elad Hazan, and Satyen Kale. 2012. The Multiplicative Weights Update Method: a Meta-Algorithm and Applications. Theory Comput. 8, 1 (2012), 121--164.Google ScholarCross Ref
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. 2002. Finite-time Analysis of the Multiarmed Bandit Problem. Mach. Learn. 47, 2--3 (2002), 235--256. Google ScholarDigital Library
Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. 2002. The Nonstochastic Multiarmed Bandit Problem. SIAM J. Comput. 32, 1 (2002), 48--77. Preliminary version in Proceedings of the 36th IEEE FOCS, 1995. Google ScholarDigital Library
Moshe Babaioff, Yogeshwer Sharma, and Aleksandrs Slivkins. 2014. Characterizing Truthful Multi-armed Bandit Mechanisms. SIAM J. Comput. 43, 1 (2014), 194--230. Preliminary version in Proceedings of the 10th ACM EC, 2009. Google ScholarDigital Library
Moshe Babaioff, Shaddin Dughmi, Robert D. Kleinberg, and Aleksandrs Slivkins. 2015. Dynamic Pricing with Limited Supply. ACM Trans. Econ. Comput. 3, 1 (2015), 4. Special issue for Proceedings of the 13th ACM EC, 2012. Google ScholarDigital Library
Ashwinkumar Badanidiyuru, Robert Kleinberg, and Yaron Singer. 2012. Learning on a budget: posted price mechanisms for online procurement. In Proceedings of the 13th ACM EC. 128--145. Google ScholarDigital Library
Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. 2013. Bandits with Knapsacks. In Proceedings of the 54th IEEE FOCS. Google ScholarDigital Library
Ashwinkumar Badanidiyuru, John Langford, and Aleksandrs Slivkins. 2014. Resourceful Contextual Bandits. In Proceedings of the 27th COLT.Google Scholar
Omar Besbes and Assaf Zeevi. 2009. Dynamic Pricing Without Knowing the Demand Function: Risk Bounds and Near-Optimal Algorithms. Oper. Res. 57 (2009), 1407--1420. Issue 6. Google ScholarDigital Library
Omar Besbes and Assaf J. Zeevi. 2012. Blind Network Revenue Management. Oper. Res. 60, 6 (2012), 1537--1550.Google ScholarDigital Library
Avrim Blum, Vijay Kumar, Atri Rudra, and Felix Wu. 2003. Online learning in online auctions. In Proceedings of the 14th ACM-SIAM SODA. 202--204. Google ScholarDigital Library
Arnoud V. Den Boer. 2015. Dynamic pricing and learning: Historical origins, current research, and new directions. Surv. Oper. Res. Manage. Sci. 20, 1 (June 2015).Google ScholarCross Ref
Sébastien Bubeck and Nicolo Cesa-Bianchi. 2012. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Found.Trends Mach. Learn. 5, 1 (2012).Google Scholar
Nicoló Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. 2013. Regret Minimization for Reserve Prices in Second-Price Auctions. In Proceedings of the ACM-SIAM SODA. Google ScholarDigital Library
Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. 2011. Contextual Bandits with Linear Payoff Functions. In Proceedings of the 14th AISTATS.Google Scholar
Varsha Dani, Thomas P. Hayes, and Sham Kakade. 2008. Stochastic Linear Optimization under Bandit Feedback. In Proceedings of the 21th COLT. 355--366.Google Scholar
Nikhil Devanur and Sham M. Kakade. 2009. The Price of Truthfulness for Pay-Per-Click Auctions. In Proceedings of the 10th ACM EC. 99--106. Google ScholarDigital Library
Nikhil R. Devanur and Thomas P. Hayes. 2009. The AdWords problem: Online keyword matching with budgeted bidders under random permutations. In Proceedings of the 10th ACM EC. 71--78. Google ScholarDigital Library
Nikhil R. Devanur, Kamal Jain, Balasubramanian Sivan, and Christopher A. Wilkens. 2011. Near optimal online algorithms and fast approximation algorithms for resource allocation problems. In Proceedings of the 12th ACM EC. 29--38. Google ScholarDigital Library
Wenkui Ding, Tao Qin, Xu-Dong Zhang, and Tie-Yan Liu. 2013. Multi-Armed Bandit with Budget Constraint and Variable Costs. In Proceedings of the 27th AAAI. Google ScholarDigital Library
Miroslav Dudíik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. 2011. Efficient Optimal Leanring for Contextual Bandits. In Proceedings of the 27th UAI. Google ScholarDigital Library
Eyal Even-Dar, Shie Mannor, and Yishay Mansour. 2002. PAC bounds for multi-armed bandit and Markov decision processes. In Proceedings of the 15th COLT. 255--270. Google ScholarDigital Library
Jon Feldman, Monika Henzinger, Nitish Korula, Vahab S. Mirrokni, and Clifford Stein. 2010. Online Stochastic Packing Applied to Display Ad Allocation. In Proceedings of the 18th ESA. 182--194. Google ScholarDigital Library
Yoav Freund and Robert E. Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (1997), 119--139. Google ScholarDigital Library
Naveen Garg and Jochen Könemann. 2007. Faster and simpler algorithms for multicommodity flow and other fractional packing problems. SIAM J. Comput. 37, 2 (2007), 630--652. Google ScholarDigital Library
Sudipta Guha and Kamesh Munagala. 2007. Multi-armed Bandits with Metric Switching Costs. In Proceedings of the 36th ICALP. 496--507. Google ScholarDigital Library
Anupam Gupta, Ravishankar Krishnaswamy, Marco Molinaro, and R. Ravi. 2011. Approximation Algorithms for Correlated Knapsacks and Non-martingale Bandits. In Proceedings of the 52nd IEEE FOCS. 827--836. Google ScholarDigital Library
András György, Levente Kocsis, Ivett Szabó, and Csaba Szepesvári. 2007. Continuous Time Associative Bandit Problems. In Proceedings of the 20th IJCAI. 830--835. Google ScholarDigital Library
Elad Hazan and Nimrod Megiddo. 2007. Online Learning with Prior Information. In Proceedings of the 20th COLT. 499--513. Google ScholarDigital Library
Robert Kleinberg. 2004. Nearly Tight Bounds for the Continuum-Armed Bandit Problem. In Proceedings of the 18th NIPS. Google ScholarDigital Library
Robert Kleinberg. 2007. Lecture notes for CS 683 (week 2), Cornell University. Retrieved from http://www.cs.cornell.edu/courses/cs683/2007sp/lecnotes/week2.pdf.Google Scholar
Robert Kleinberg and Tom Leighton. 2003. The Value of Knowing a Demand Curve: Bounds on Regret for Online Posted-Price Auctions. In Proceedings of the 44th IEEE FOCS. 594--605. Google ScholarDigital Library
Robert Kleinberg and Aleksandrs Slivkins. 2010. Sharp Dichotomies for Regret Minimization in Metric Spaces. In Proceedings of the 21st ACM-SIAM SODA. Google ScholarDigital Library
Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. 2008. Multi-Armed Bandits in Metric Spaces. In Proceedings of the 40th ACM STOC. 681--690. Google ScholarDigital Library
Tze Leung Lai and Herbert Robbins. 1985. Asymptotically efficient Adaptive Allocation Rules. Adv. Appl. Math. 6 (1985), 4--22. Google ScholarDigital Library
John Langford and Tong Zhang. 2007. The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits. In Proceedings of the 21st NIPS. Google ScholarDigital Library
Nick Littlestone and Manfred K. Warmuth. 1994. The Weighted Majority Algorithm. Info. Comput. 108, 2 (1994), 212--260. Google ScholarDigital Library
Tyler Lu, Dávid Pál, and Martin Pál. 2010. Showing Relevant Ads via Lipschitz Context Multi-Armed Bandits. In Proceedings of the 14th AISTATS.Google Scholar
Marco Molinaro and R. Ravi. 2012. Geometry of Online Packing Linear Programs. In Proceedings of the 39th ICALP. 701--713. Google ScholarDigital Library
Sandeep Pandey, Deepak Agarwal, Deepayan Chakrabarti, and Vanja Josifovski. 2007. Bandits for Taxonomies: A Model-based Approach. In Proceedings of the SDM.Google ScholarCross Ref
Sandeep Pandey, Deepayan Chakrabarti, and Deepak Agarwal. 2007. Multi-armed Bandit Problems with Dependent Arms. In Proceedings of the 24th ICML. Google ScholarDigital Library
Christos H. Papadimitriou and John N. Tsitsiklis. 1999. The complexity of optimal queuing network control. Math. Oper. Res. 24, 2 (1999), 293--305. Google ScholarCross Ref
Serge A. Plotkin, David B. Shmoys, and Eva Tardos. 1995. Fast Approximation Algorithms for Fractional Packing and Covering Problems. Math. Oper. Res. 20 (1995), 257--301.Google ScholarDigital Library
Adish Singla and Andreas Krause. 2013. Truthful incentives in crowdsourcing tasks using regret minimization mechanisms. In Proceedings of the 22nd WWW. 1167--1178. Google ScholarDigital Library
Aleksandrs Slivkins and Jennifer Wortman Vaughan. 2013. Online Decision Making in Crowdsourcing Markets: Theoretical Challenges. SIGecom Exch. 12, 2 (December 2013). Google ScholarDigital Library
Long Tran-Thanh, Archie Chapman, Enrique Munoz de Cote, Alex Rogers, and Nicholas R. Jennings. 2010. &epsi;-first policies for budget-limited multi-armed bandits. In Proceedings of the 24th AAAI. 1211--1216. Google ScholarDigital Library
Long Tran-Thanh, Archie Chapman, Alex Rogers, and Nicholas R. Jennings. 2012. Knapsack based optimal policies for budget-limited multi-armed bandits. In Proceedings of the 26th AAAI. 1134--1140. Google ScholarDigital Library
Long Tran-Thanh, Lampros C. Stavrogiannis, Victor Naroditskiy, Valentin Robu, Nicholas R. Jennings, and Peter Key. 2014. Efficient regret bounds for online bid optimisation in budget-limited sponsored search auctions. In Proceedings of the 30th UAI. Google ScholarDigital Library
Zizhuo Wang, Shiming Deng, and Yinyu Ye. 2014. Close the Gaps: A Learning-While-Doing Algorithm for Single-Product Revenue Management Problems. Oper. Res. 62, 2 (2014), 318--331. Google ScholarDigital Library
Peter Whittle. 1980. Multi-armed Bandits and the Gittins Index. J. Roy. Stat. Soc., Ser. B 42, 2 (1980), 143--149.Google Scholar

Index Terms

Bandits with Knapsacks
1. Theory of computation
  1. Design and analysis of algorithms
    1. Online algorithms
      1. Online learning algorithms
  2. Theory and algorithms for application domains
    1. Algorithmic game theory and mechanism design
      1. Algorithmic mechanism design
      2. Computational pricing and auctions
    2. Machine learning theory
      1. Online learning theory
      2. Regret bounds

Recommendations

Adversarial Bandits with Knapsacks
We consider Bandits with Knapsacks (henceforth, BwK), a general model for multi-armed bandits under supply/budget constraints. In particular, a bandit algorithm needs to solve a well-known knapsack problem: find an optimal packing of items into a limited-...
Read More
Bandits with concave rewards and convex knapsacks
EC '14: Proceedings of the fifteenth ACM conference on Economics and computation

In this paper, we consider a very general model for exploration-exploitation tradeoff which allows arbitrary concave rewards and convex constraints on the decisions across time, in addition to the customary limitation on the time horizon. This model ...
Read More
Bandits with Knapsacks
FOCS '13: Proceedings of the 2013 IEEE 54th Annual Symposium on Foundations of Computer Science

Multi-armed bandit problems are the predominant theoretical model of exploration-exploitation tradeoffs in learning, and they have countless applications ranging from medical trials, to communication networks, to Web search and advertising. In many of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Journal of the ACM Volume 65, Issue 3
June 2018
285 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/3191817
Editor:
Éva Tardos
Cornell University
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 March 2018
- Accepted: 1 November 2017
- Revised: 1 September 2017
- Received: 1 July 2015
Published in jacm Volume 65, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Multi-armed bandits
dynamic pricing
knapsack constraints
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 80
  Total Citations
  View Citations
- 2,492
  Total Downloads
- Downloads (Last 12 months)682
- Downloads (Last 6 weeks)103
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Bandits with Knapsacks

Journal of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Adversarial Bandits with Knapsacks

Bandits with concave rewards and convex knapsacks

Bandits with Knapsacks