Abstract
Complex machine learning models are deployed in several critical domains including healthcare and autonomous vehicles nowadays, albeit as functional blackboxes. Consequently, there has been a recent surge in interpreting decisions of such complex models in order to explain their actions to humans. Models which correspond to human interpretation of a task are more desirable in certain contexts and can help attribute liability, build trust, expose biases and in turn build better models. It is therefore crucial to understand how and which models conform to human understanding of tasks. In this paper we present a large-scale crowdsourcing study that reveals and quantifies the dissonance between human and machine understanding, through the lens of an image classification task. In particular, we seek to answer the following questions: Which (well performing) complex ML models are closer to humans in their use of features to make accurate predictions? How does task difficulty affect the feature selection capability of machines in comparison to humans? Are humans consistently better at selecting features that make image recognition more accurate? Our findings have important implications on human-machine collaboration, considering that a long term goal in the field of artificial intelligence is to make machines capable of learning and reasoning like humans.
- Ashraf Abdul, Jo Vermeulen, Danding Wang, Brian Y Lim, and Mohan Kankanhalli. 2018. Trends and trajectories for explainable, accountable and intelligible systems: An hci research agenda. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 582.Google ScholarDigital Library
- Arash Afraz, Daniel LK Yamins, and James J DiCarlo. 2014. Neural mechanisms underlying visual object recognition. In Cold Spring Harbor symposia on quantitative biology, Vol. 79. Cold Spring Harbor Laboratory Press, 99--107.Google Scholar
- Avishek Anand, Kilian Bizer, Alexander Erlei, Ujwal Gadiraju, Christian Heinze, Lukas Meub, Wolfgang Nejdl, and Bjoern Steinroetter. 2018. Effects of Algorithmic Decision-Making and Interpretability on Human Behavior: Experiments using Crowdsourcing. In Proceedings of the HCOMP 2018 Works in Progress and Demonstration Papers Track of the sixth AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2018), Zurich, Switzerland, July 5--8, 2018.Google Scholar
- Mark E Auckland, Kyle R Cave, and Nick Donnelly. 2007. Nontarget objects can influence perceptual processes during object recognition. Psychonomic bulletin & review, Vol. 14, 2 (2007), 332--337.Google Scholar
- Shlomo Berkovsky, Ronnie Taib, and Dan Conway. 2017. How to recommend?: User trust factors in movie recommender systems. In Proceedings of the 22nd International Conference on Intelligent User Interfaces. ACM, 287--300.Google ScholarDigital Library
- Irving Biederman. 1985. Human image understanding: Recent research and a theory. Computer vision, graphics, and image processing, Vol. 32, 1 (1985), 29--73.Google Scholar
- Reuben Binns, Max Van Kleek, Michael Veale, Ulrik Lyngs, Jun Zhao, and Nigel Shadbolt. 2018. 'It's Reducing a Human Being to a Percentage': Perceptions of Justice in Algorithmic Decisions. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). ACM, New York, NY, USA, Article 377, bibinfonumpages14 pages. https://doi.org/10.1145/3173574.3173951Google ScholarDigital Library
- Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. 2015. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1721--1730.Google ScholarDigital Library
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 248--255.Google ScholarCross Ref
- Finale Doshi-Velez and Been Kim. 2017. Towards a rigorous science of interpretable machine learning. (2017).Google Scholar
- Leonidas AA Doumas, Guillermo Puebla, and Andrea E Martin. 2018. Human-like generalization in a machine through predicate learning. arXiv preprint arXiv:1806.01709 (2018).Google Scholar
- Michael W Eysenck and Mark T Keane. 2013. Cognitive psychology: A student's handbook .Psychology press.Google Scholar
- Gerhard Friedrich and Markus Zanker. 2011. A taxonomy for generating explanations in recommender systems. AI Magazine, Vol. 32, 3 (2011), 90--98.Google ScholarDigital Library
- Ujwal Gadiraju, Besnik Fetahu, and Ricardo Kawase. 2015a. Training workers for improving performance in crowdsourcing microtasks. In Design for Teaching and Learning in a Networked World. Springer, 100--114.Google Scholar
- Ujwal Gadiraju, Ricardo Kawase, Stefan Dietze, and Gianluca Demartini. 2015b. Understanding malicious behavior in crowdsourcing platforms: The case of online surveys. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM, 1631--1640.Google ScholarDigital Library
- Ujwal Gadiraju, Jie Yang, and Alessandro Bozzon. 2017. Clarity is a worthwhile quality: On the role of task clarity in microtask crowdsourcing. In Proceedings of the 28th ACM Conference on Hypertext and Social Media. ACM, 5--14.Google ScholarDigital Library
- Robert Geirhos, Carlos RM Temme, Jonas Rauber, Heiko H Schütt, Matthias Bethge, and Felix A Wichmann. 2018. Generalisation in humans and deep neural networks. In Advances in Neural Information Processing Systems. 7549--7561.Google Scholar
- Justin Scott Giboney, Susan A Brown, Paul Benjamin Lowry, and Jay F Nunamaker Jr. 2015. User acceptance of knowledge-based system recommendations: Explanations, arguments, and fit. Decision Support Systems, Vol. 72 (2015), 1--10.Google ScholarDigital Library
- Shirley Gregor and Izak Benbasat. 1999. Explanations from intelligent systems: Theoretical foundations and implications for practice. MIS quarterly (1999), 497--530.Google Scholar
- Anikó Hannák, Claudia Wagner, David Garcia, Alan Mislove, Markus Strohmaier, and Christo Wilson. 2017. Bias in online freelance marketplaces: Evidence from taskrabbit and fiverr. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing. ACM, 1914--1933.Google ScholarDigital Library
- IEEE Global Initiative et al. 2016. Ethically Aligned Design. IEEE Standards v1 (2016).Google Scholar
- Kalervo J"arvelin and Jaana Kek"al"ainen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS), Vol. 20, 4 (2002), 422--446.Google ScholarDigital Library
- Tatiana Josephy, Matt Lease, Praveen Paritosh, Markus Krause, Mihai Georgescu, Michael Tjalve, and Daniela Braga. 2014. CrowdScale 2013: Crowdsourcing at Scale Workshop Report. AI Magazine, Vol. 35, 2 (2014), 75--78.Google ScholarCross Ref
- Daniel Kahneman. 2003. A perspective on judgment and choice: mapping bounded rationality. American psychologist, Vol. 58, 9 (2003), 697.Google Scholar
- Daniel Kahneman, Andrew M Rosenfield, Linnea Gandhi, and Tom Blaser. 2016. Noise: How to overcome the high, hidden cost of inconsistent decision making. Harvard business review, Vol. 94, 10 (2016), 38--46.Google Scholar
- Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. 2016. Examples are not enough, learn to criticize! criticism for interpretability. In Advances in Neural Information Processing Systems. 2280--2288.Google Scholar
- Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. 2017. Human decisions and machine predictions. The quarterly journal of economics, Vol. 133, 1 (2017), 237--293.Google ScholarCross Ref
- Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. arXiv preprint arXiv:1703.04730 (2017).Google ScholarDigital Library
- Ranjay A Krishna, Kenji Hata, Stephanie Chen, Joshua Kravitz, David A Shamma, Li Fei-Fei, and Michael S Bernstein. 2016. Embracing error to enable rapid crowdsourcing. In Proceedings of the 2016 CHI conference on human factors in computing systems. ACM, 3167--3179.Google ScholarDigital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.Google Scholar
- Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. 2017. Building machines that learn and think like people. Behavioral and Brain Sciences, Vol. 40 (2017).Google ScholarCross Ref
- Wallace Lawson, Laura Hiatt, and J Trafton. 2014. Leveraging cognitive context for object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 381--386.Google ScholarDigital Library
- Min Kyung Lee and Su Baykal. 2017. Algorithmic mediation in group decisions: Fairness perceptions of algorithmically mediated vs. discussion-based social division. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing. ACM, 1035--1048.Google ScholarDigital Library
- Min Kyung Lee, Daniel Kusbit, Evan Metsky, and Laura Dabbish. 2015. Working with machines: The impact of algorithmic and data-driven management on human workers. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM, 1603--1612.Google ScholarDigital Library
- Benjamin Letham, Cynthia Rudin, Tyler H McCormick, David Madigan, et al. 2015. Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model. The Annals of Applied Statistics, Vol. 9, 3 (2015), 1350--1371.Google ScholarCross Ref
- Zachary C Lipton. 2016. The mythos of model interpretability. ICML Workshop on Human Interpretability of Machine Learning (2016).Google Scholar
- Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems. 4765--4774.Google Scholar
- Gaspard Monge. 1781. Mémoire sur la théorie des déblais et des remblais. Histoire de l'Académie Royale des Sciences de Paris ( 1781).Google Scholar
- David G Myers. 2002. The powers & perils of intuition. Psychology Today, Vol. 35, 6 (2002), 42--52.Google Scholar
- Kenya Freeman Oduor and Eric N Wiebe. 2008. The effects of automated decision algorithm modality and transparency on reported trust and task performance. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Vol. 52. SAGE Publications Sage CA: Los Angeles, CA, 302--306.Google ScholarCross Ref
- David Oleson, Alexander Sorokin, Greg P Laughlin, Vaughn Hester, John Le, and Lukas Biewald. 2011. Programmatic Gold: Targeted and Scalable Quality Assurance in Crowdsourcing. Human computation, Vol. 11, 11 (2011).Google ScholarDigital Library
- Cathy O'Neill. 2016. Weapons of math destruction: How big data increases inequality and threatens democracy. Nueva York, NY: Crown Publishing Group (2016).Google Scholar
- Alexis Papadimitriou, Panagiotis Symeonidis, and Yannis Manolopoulos. 2012. A generalized taxonomy of explanations styles for traditional and social recommender systems. Data Mining and Knowledge Discovery, Vol. 24, 3 (2012), 555--583.Google ScholarDigital Library
- Robin L Plackett. 1975. The analysis of permutations. Applied Statistics (1975), 193--202.Google Scholar
- Martin Porcheron, Joel E Fischer, Stuart Reeves, and Sarah Sharples. 2018. Voice interfaces in everyday life. In proceedings of the 2018 CHI conference on human factors in computing systems. ACM, 640.Google ScholarDigital Library
- Emilee Rader, Kelley Cotter, and Janghee Cho. 2018. Explanations As Mechanisms for Supporting Algorithmic Transparency. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). ACM, New York, NY, USA, Article 103, bibinfonumpages13 pages. https://doi.org/10.1145/3173574.3173677Google ScholarDigital Library
- Iyad Rahwan, Manuel Cebrian, Nick Obradovich, Josh Bongard, Jean-Francc ois Bonnefon, Cynthia Breazeal, Jacob W Crandall, Nicholas A Christakis, Iain D Couzin, Matthew O Jackson, et al. 2019. Machine behaviour. Nature, Vol. 568, 7753 (2019), 477.Google Scholar
- Rishi Rajalingham, Elias B Issa, Pouya Bashivan, Kohitij Kar, Kailyn Schmidt, and James J DiCarlo. 2018. Large-scale, high-resolution comparison of the core visual object recognition behavior of humans, monkeys, and state-of-the-art deep artificial neural networks. Journal of Neuroscience, Vol. 38, 33 (2018), 7255--7269.Google ScholarCross Ref
- Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Model-agnostic interpretability of machine learning. arXiv preprint arXiv:1606.05386 (2016).Google Scholar
- Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why Should I Trust You?: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1135--1144.Google ScholarDigital Library
- Andrew Slavin Ross, Michael C. Hughes, and Finale Doshi-Velez. 2017. Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17. 2662--2670. https://doi.org/10.24963/ijcai.2017/371Google ScholarCross Ref
- Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. 1998. A metric for distributions with applications to image databases. In Computer Vision, 1998. Sixth International Conference on. IEEE, 59--66.Google ScholarCross Ref
- Martin Schrimpf, Jonas Kubilius, Ha Hong, Najib J Majaj, Rishi Rajalingham, Elias B Issa, Kohitij Kar, Pouya Bashivan, Jonathan Prescott-Roy, Kailyn Schmidt, et al. 2018. Brain-Score: Which Artificial Neural Network for Object Recognition is most Brain-Like? BioRxiv (2018), 407007.Google Scholar
- Grace S Shieh. 1998. A weighted Kendall's tau statistic. Statistics & probability letters, Vol. 39, 1 (1998), 17--24.Google Scholar
- Hirokazu Shirado and Nicholas A Christakis. 2017. Locally noisy autonomous agents improve global human coordination in network experiments. Nature, Vol. 545, 7654 (2017), 370.Google Scholar
- David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. nature, Vol. 529, 7587 (2016), 484.Google Scholar
- David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. 2017. Mastering the game of Go without human knowledge. Nature, Vol. 550, 7676 (2017), 354.Google Scholar
- Herbert Alexander Simon. 1997. Models of bounded rationality: Empirically grounded economic reason. Vol. 3. MIT press.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- Elizabeth Stowell, Mercedes C Lyson, Herman Saksono, Reneé C Wurth, Holly Jimison, Misha Pavel, and Andrea G Parker. 2018. Designing and Evaluating mHealth Interventions for Vulnerable Populations: A Systematic Review. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 15.Google ScholarDigital Library
- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning.. In AAAI, Vol. 4. 12.Google ScholarDigital Library
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--9.Google ScholarCross Ref
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818--2826.Google ScholarCross Ref
- Nava Tintarev and Judith Masthoff. 2007. A survey of explanations in recommender systems. In Data Engineering Workshop, 2007 IEEE 23rd International Conference on. IEEE, 801--810.Google ScholarDigital Library
- Alexandra Vtyurina and Adam Fourney. 2018. Exploring the role of conversational cues in guided task support with virtual assistants. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 208.Google ScholarDigital Library
- Jiaxuan Wang, Jeeheh Oh, Haozhu Wang, and Jenna Wiens. 2018. Learning Credible Models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). ACM, New York, NY, USA, 2417--2426. https://doi.org/10.1145/3219819.3220070Google ScholarDigital Library
- Jiaxuan Wang, Jeeheh Oh, Haozhu Wang, and Jenna Wiens. 2018. Learning credible models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2417--2426.Google ScholarDigital Library
- Weiquan Wang and Izak Benbasat. 2007. Recommendation agents for electronic commerce: Effects of explanation facilities on trusting beliefs. Journal of Management Information Systems, Vol. 23, 4 (2007), 217--246.Google ScholarDigital Library
- William Webber, Alistair Moffat, and Justin Zobel. 2010. A Similarity Measure for Indefinite Rankings. ACM Trans. Inf. Syst., Vol. 28, 4, Article 20 (Nov. 2010), bibinfonumpages38 pages. https://doi.org/10.1145/1852102.1852106Google ScholarDigital Library
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. 2048--2057.Google ScholarDigital Library
- Daniel LK Yamins, Ha Hong, Charles F Cadieu, Ethan A Solomon, Darren Seibert, and James J DiCarlo. 2014. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences, Vol. 111, 23 (2014), 8619--8624.Google ScholarCross Ref
- Tal Zarsky. 2016. The trouble with algorithmic decisions: An analytic road map to examine efficiency and fairness in automated and opaque decision making. Science, Technology, & Human Values, Vol. 41, 1 (2016), 118--132.Google ScholarCross Ref
- Nan-ning Zheng, Zi-yi Liu, Peng-ju Ren, Yong-qiang Ma, Shi-tao Chen, Si-yu Yu, Jian-ru Xue, Ba-dong Chen, and Fei-yue Wang. 2017. Hybrid-augmented intelligence: collaboration and cognition. Frontiers of Information Technology & Electronic Engineering, Vol. 18, 2 (2017), 153--179.Google ScholarCross Ref
Index Terms
- Dissonance Between Human and Machine Understanding
Recommendations
Analysis of the User Experience with a Multiperspective Tool for Explainable Machine Learning in Light of Interactive Principles
IHC '21: Proceedings of the XX Brazilian Symposium on Human Factors in Computing SystemsMachine Learning (ML) models have been widely used nowadays, as "magical black boxes", in many different domains and for distinct goals, but the way they generate their results is not fully understood yet, including by knowledgeable users. If users ...
A human-centred intelligent system framework: meta-synthetic engineering
From the viewpoint of knowledge and intelligence, to bridge data mining and agents, this paper deal with an efficient way that is building intelligent systems by means of meta-synthesis proposed by Chinese scientists, which is used multiple human ...
Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses
Detecting objects in cluttered scenes and estimating articulated human body parts from 2D images are two challenging problems in computer vision. The difficulty is particularly pronounced in activities involving human-object interactions (e.g., playing ...
Comments