ABSTRACT
Precisely evaluating the effect of new policies (e.g. ad-placement models, recommendation functions, ranking functions) is one of the most important problems for improving interactive systems. The conventional policy evaluation methods rely on online A/B tests, but they are usually extremely expensive and may have undesirable impacts. Recently, Inverse Propensity Score (IPS) estimators are proposed as alternatives to evaluate the effect of new policy with offline logged data that was collected from a different policy in the past. They tend to remove the distribution shift induced by past policy. However, they ignore the distribution shift that would be induced by the new policy, which results in imprecise evaluation. Moreover, their performances rely on accurate estimation of propensity score, which can not be guaranteed or validated in practice. In this paper, we propose a non-parametric method, named Focused Context Balancing (FCB) algorithm, to learn sample weights for context balancing, so that the distribution shift induced by the past policy and new policy can be eliminated respectively. To validate the effectiveness of our FCB algorithm, we conduct extensive experiments on both synthetic and real world datasets. The experimental results clearly demonstrate that our FCB algorithm outperforms existing estimators by achieving more precise and robust results for offline policy evaluation.
- Aman Agarwal, Soumya Basu, Tobias Schnabel, and Thorsten Joachims. 2017. Effective Evaluation Using Logged Bandit Feedback from Multiple Loggers. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . ACM, 687--696. Google ScholarDigital Library
- Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. 2014. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning. 1638--1646. Google ScholarDigital Library
- Susan Athey, Guido W Imbens, and Stefan Wager. 2016. Approximate residual balancing: debiased inference of average treatment effects in high dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) (2016).Google Scholar
- Peter C Austin. 2011. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate behavioral research , Vol. 46, 3 (2011), 399--424.Google Scholar
- Heejung Bang and James M Robins. 2005. Doubly robust estimation in missing data and causal inference models. Biometrics , Vol. 61, 4 (2005), 962--973.Google ScholarCross Ref
- Léon Bottou, Jonas Peters, Joaquin Qui nonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research , Vol. 14, 1 (2013), 3207--3260. Google ScholarDigital Library
- Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, and Whitney K Newey. 2016. Double machine learning for treatment and causal parameters . Technical Report. cemmap working paper, Centre for Microdata Methods and Practice.Google Scholar
- Miroslav Dudik, John Langford, and Lihong Li. 2011. Doubly robust policy evaluation and learning. In International Conference on International Conference on Machine Learning. 1097--1104. Google ScholarDigital Library
- Max H Farrell. 2015. Robust inference on average treatment effects with possibly more covariates than observations. Journal of Econometrics , Vol. 189, 1 (2015), 1--23.Google ScholarCross Ref
- Jens Hainmueller. 2012. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis , Vol. 20, 1 (2012), 25--46.Google ScholarCross Ref
- John Hammersley. 2013. Monte carlo methods .Springer Science & Business Media.Google Scholar
- Daniel G Horvitz and Donovan J Thompson. 1952. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association , Vol. 47, 260 (1952), 663--685.Google ScholarCross Ref
- Ron Kohavi and Roger Longbotham. 2011. Unexpected results in online controlled experiments. ACM SIGKDD Explorations Newsletter , Vol. 12, 2 (2011), 31--35. Google ScholarDigital Library
- Augustine Kong. 1992. A note on importance sampling using standardized weights. University of Chicago, Dept. of Statistics, Tech. Rep , Vol. 348 (1992).Google Scholar
- Kun Kuang, Peng Cui, Bo Li, Meng Jiang, and Shiqiang Yang. 2017a. Estimating Treatment Effect in the Wild via Differentiated Confounder Balancing. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 265--274. Google ScholarDigital Library
- Kun Kuang, Peng Cui, Bo Li, Meng Jiang, Shiqiang Yang, and Fei Wang. 2017b. Treatment effect estimation with data-driven variable decomposition. In Thirty-First AAAI Conference on Artificial Intelligence . Google ScholarDigital Library
- Kun Kuang, Meng Jiang, Peng Cui, and Shiqiang Yang. 2016. Steering social media promotions with effective strategies. In 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 985--990.Google ScholarCross Ref
- John Langford and Tong Zhang. 2008. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in neural information processing systems. 817--824. Google ScholarDigital Library
- Randall Lewis and David Reiley. 2009. Retail advertising works! measuring the effects of advertising on sales via a controlled experiment on yahoo! (2009).Google Scholar
- Lihong Li, Shunbao Chen, Jim Kleban, and Ankur Gupta. 2015. Counterfactual estimation and optimization of click metrics in search engines: A case study. In Proceedings of the 24th International Conference on World Wide Web. ACM, 929--934. Google ScholarDigital Library
- Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. 2011. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining. ACM, 297--306. Google ScholarDigital Library
- Art B Owen. 2013. Monte Carlo theory, methods and examples. Monte Carlo Theory, Methods and Examples. Art Owen (2013).Google Scholar
- Michael JD Powell and J Swann. 1966. Weighted uniform samplinga Monte Carlo technique for reducing variance. IMA Journal of Applied Mathematics , Vol. 2, 3 (1966), 228--236.Google ScholarCross Ref
- Paul R Rosenbaum and Donald B Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika , Vol. 70, 1 (1983), 41--55.Google ScholarCross Ref
- Reuven Y Rubinstein and Dirk P Kroese. 2016. Simulation and the Monte Carlo method . Vol. 10. John Wiley & Sons. Google ScholarDigital Library
- Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. 2016. Recommendations as treatments: Debiasing learning and evaluation. arXiv preprint arXiv:1602.05352 (2016). Google ScholarDigital Library
- Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. 2010. Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems. 2217--2225. Google ScholarDigital Library
- Adith Swaminathan and Thorsten Joachims. 2015a. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning. 814--823. Google ScholarDigital Library
- Adith Swaminathan and Thorsten Joachims. 2015b. The self-normalized estimator for counterfactual learning. In Advances in Neural Information Processing Systems. 3231--3239. Google ScholarDigital Library
- Daniel Westreich, Justin Lessler, and Michele Jonsson Funk. 2010. Propensity score estimation: neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. Journal of clinical epidemiology , Vol. 63, 8 (2010), 826--833.Google ScholarCross Ref
- José R Zubizarreta. 2015. Stable weights that balance covariates for estimation with incomplete outcome data. J. Amer. Statist. Assoc. , Vol. 110, 511 (2015), 910--922.Google ScholarCross Ref
Index Terms
- Focused Context Balancing for Robust Offline Policy Evaluation
Recommendations
Designing Fast and Scalable XACML Policy Evaluation Engines
Most prior research on policies has focused on correctness. While correctness is an important issue, the adoption of policy-based computing may be limited if the resulting systems are not implemented efficiently and thus perform poorly. To increase the ...
XACBench: a XACML policy benchmark
AbstractXACML standard defines a declarative language to determine access control policies which are critical for deploying security solutions. It is important to evaluate the performance of policies defined by XACML, for applications such as policy ...
Adaptive Reordering and Clustering-Based Framework for Efficient XACML Policy Evaluation
The adoption of XACML as the standard for specifying access control policies for various applications, especially web services is vastly increasing. This calls for high performance XACML policy evaluation engines. A policy evaluation engine can easily ...
Comments