ABSTRACT
A growing number of domains, including affect recognition and movement analysis, require a single, real number ground truth label capturing some property of a video clip. We term this the provision of continuum labels. Unfortunately, there is often an uncacceptable trade-off between label consistency and the efficiency of the labelling process with current tools. We present a novel interaction technique, setwise comparison, which leverages the intrinsic human capability for consistent relative judgements and the TrueSkill algorithm to solve this problem. We describe SorTable, a system demonstrating this technique. We conducted a real-world study where clinicians labelled videos of patients with multiple sclerosis for the ASSESS MS computer vision system. In assessing the efficiency-consistency trade-off of setwise versus pairwise comparison, we demonstrated that not only is setwise comparison more efficient, but it also elicits more consistent labels. We further consider how our findings relate to the interactive machine learning literature.
Supplemental Material
- 2016. TrueSkill Python Code. http://trueskill.org/. (2016). Accessed: Friday 8th January, 2016.Google Scholar
- C.K. Abbey and M.P. Eckstein. 2002. Classification image analysis: estimation and statistical inference for two-alternative forced-choice experiments. Journal of vision 2, 1 (2002), 66--78.Google ScholarCross Ref
- S. Afzal and P. Robinson. 2014. Emotion Data Collection and Its Implications for Affective Computing. In The Oxford Handbook of Affective Computing. 359--369.Google Scholar
- K. Ali, D. Hasler, and F. Fleuret. 2011. Flowboost -- appearance learning from sparsely annotated video. In IEEE computer vision and pattern recognition (CVPR). Google ScholarDigital Library
- Saleema Amershi, James Fogarty, Ashish Kapoor, and Desney S Tan. 2011. Effective End-User Interaction with Machine Learning. Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence (2011), 1529--1532. Google ScholarDigital Library
- Paul N Bennett, David Maxwell Chickering, and Anton Mityagin. 2009. Learning consensus opinion: mining data from a labeling game. In Proceedings of the 18th international conference on World wide web. ACM, 121--130. Google ScholarDigital Library
- R. Bogacz, E. Brown, J. Moehlis, P. Holmes, and J.D. Cohen. 2006. The physics of optimal decision making: A formal analysis of models of performance in two-alternative forced-choice tasks. Psychological review 113, 4 (2006), 700.Google Scholar
- RA Bradley. 1952. Rank Analysis of Incomplete Block Designs: The Method of Paired Comparisons. Biometrika 39 (1952), 324--345.Google Scholar
- Carla E. Brodley and Mark A. Friedl. 1999. Identifying mislabeled training data. Journal of Artificial Intelligence Research (1999), 131--167.Google Scholar
- Ben Carterette, Paul N. Bennett, David Maxwell Chickering, and Susan T. Dumais. 2008. Here or there preference judgments for relevance. Lecture Notes in Computer Science 4956 LNCS (2008), 16--27. DOI: http://dx.doi.org/10.1007/978--3--540--78646--7{_}5 Google ScholarDigital Library
- Jeffrey A Cohen, Stephen C Reingold, Chris H Polman, Jerry S Wolinsky, International Advisory Committee on Clinical Trials in Multiple Sclerosis, and others. 2012. Disability outcome measures in multiple sclerosis clinical trials: current status and future prospects. The Lancet Neurology 11, 5 (2012), 467--476.Google ScholarCross Ref
- R. Cowie, S. Douglas-Cowie, E. Savvidou, E. McMahon, M. Sawey, and M. Schröder. 2000. 'FEELTRACE': An instrument for recording perceived emotion in real time.. In ISCA tutorial and research workshop (ITRW) on speech and emotion.Google Scholar
- Jerry Alan Fails and Dan R. Olsen. 2003. Interactive machine learning. Proceedings of the 8th international conference on Intelligent user interfaces IUI '03 (2003), 39. DOI:http://dx.doi.org/10.1145/604050.604056Google ScholarCross Ref
- James Fogarty, Desney S Tan, Ashish Kapoor, and Simon Winder. 2008. CueFlik: interactive concept learning in image search. Proceeding of the twenty-sixth annual CHI conference on Human factors in computing systems CHI '08 (2008), 29. DOI: http://dx.doi.org/10.1145/1357054.1357061 Google ScholarDigital Library
- Simon Fothergill, Robert Harle, and Sean Holden. 2008. Modeling the model athlete: Automatic coaching of rowing technique. In Structural, Syntactic, and Statistical Pattern Recognition. Springer, 372--381. Google ScholarDigital Library
- B. Frénay and M. Verleysen. 2014. Classification in the presence of label noise: a survey. IEEE Transactions on Neural Networks and Learning Systems 25, 5 (2014), 845--869.Google ScholarCross Ref
- Alex Groce, Todd Kulesza, Chaoqiang Zhang, Shalini Shamasunder, Margaret Burnett, Weng-Keen Wong, Simone Stumpf, Shubhomoy Das, Amber Shinsel, Forrest Bice, and Kevin McIntosh. 2014. You Are the Only Possible Oracle: Effective Test Selection for End Users of Interactive Machine Learning Systems. IEEE Transactions on Software Engineering 40, 3 (2014), 307--323. DOI:http://dx.doi.org/10.1109/TSE.2013.59 Google ScholarDigital Library
- Sandra G. Hart and Lowell E. Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. Advances in psychology 52 (1988), 139--183.Google Scholar
- R. D. Hays, R. Anderson, and D. Revicki. 1993. Psychometric considerations in evaluating health-related quality of life measures. Quality of Life Research 2, 6 (dec 1993), 441--449. http://link.springer.com/article/10.1007/BF00422218Google ScholarCross Ref
- Ralf Herbrich, Tom Minka, and Thore Graepel. TrueSkill(TM): A Bayesian Skill Rating System. In Advances in Neural Information Processing Systems (NIPS2006). 2006.Google Scholar
- P. G. Ipeirotis, F. Provost, V. S. Sheng, and J. Wang. 2014. Repeated labeling using multiple noisy labelers. Data Mining and Knowledge Discovery 28, 2 (2014), 402--441. Google ScholarDigital Library
- Christian P Kamm, Bernard MJ Uitdehaag, and Chris H Polman. 2014. Multiple sclerosis: current knowledge and future outlook. European neurology 72, 3--4 (2014), 132--141.Google Scholar
- Peter Kontschieder, Jonas F Dorn, Cecily Morrison, Robert Corish, Darko Zikic, Abigail Sellen, Marcus D'Souza, Christian P Kamm, Jessica Burggraaff, Prejaas Tewarie, and others. 2014. Quantifying Progression of Multiple Sclerosis via Classification of Depth Videos. In Medical Image Computing and Computer-Assisted Intervention--MICCAI 2014. Springer, 429--437.Google ScholarCross Ref
- S. B. Kotsiantis, I. Zaharakis, and P. Pintelas. 2007. Supervised machine learning: A review of classification techniques. Informatica 31 (2007), 249--268.Google Scholar
- Todd Kulesza, Saleema Amershi, Rich Caruana, Danyel Fisher, and Denis Charles. 2014. Structured labeling for facilitating concept evolution in machine learning. Proceedings of the 32nd annual ACM conference on Human factors in computing systems CHI '14 (2014), 3075--3084. DOI: http://dx.doi.org/10.1145/2556288.2557238 Google ScholarDigital Library
- John F Kurtzke. 1983. Rating neurologic impairment in multiple sclerosis an expanded disability status scale (EDSS). Neurology 33, 11 (1983), 1444--1444.Google ScholarCross Ref
- Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. 2008. Learning Realistic Human Actions from Movies. In IEEE conference on computer vision and pattern recognition CVPR. 1--8.Google ScholarCross Ref
- Walter S Lasecki, Mitchell Gordon, Steven P Dow, and Jeffrey P Bigham. 2014. Glance : Rapidly Coding Behavioral Video with the Crowd. In Proceedings of UIST'14. 1--11. Google ScholarDigital Library
- Dan Lockton, David Harrison, and Neville Stanton. 2008. Design with Intent: Persuasive Technology in a Wider Context. In Persuasive Technology. Springer Berlin Heidelberg, Berlin, Heidelberg, 274--278. DOI: http://dx.doi.org/10.1007/978--3--540--68504--3{_}30 Google ScholarDigital Library
- Kenneth O McGraw and Seok P Wong. 1996. Forming inferences about some intraclass correlation coefficients. Psychological methods 1, 1 (1996), 30.Google Scholar
- G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder. 2012. The SEMAINE Database: Annotated Multimodal Records of Emotionally Colored Conversations between a Person and a Limited Agent. IEEE Transactions on Affective Computing 3, 1 (Jan 2012), 5--17. DOI: http://dx.doi.org/10.1109/T-AFFC.2011.20 Google ScholarDigital Library
- F. Metze, D. Ding, E. Younessian, and A. Hauptmann. 2013. Beyond audio and video retrieval: topic-oriented multimedia summarization. International Journal of Multimedia Information Retrieval 2, 2 (2013), 131--144.Google ScholarCross Ref
- C. Morrison, K. Huckvale, B. Corish, J. Dorn, P. Kontschieder, K. O'Hara, ASSESS MS Team, A. Criminisi, and A. Sellen. 2016. Assessing Multiple Sclerosis with Kinect: Designing Computer Vision Systems for Real-World Use. To appear in Human-Computer Interaction (2016). http://research. microsoft.com/apps/pubs/default.aspx?id=255951 Google ScholarDigital Library
- JH Noseworthy, MK Vandervoort, CJ Wong, and GC Ebers. 1990. Interrater variability with the Expanded Disability Status Scale (EDSS) and Functional Systems (FS) in a multiple sclerosis clinical trial. Neurology 40, 6 (1990), 971--971.Google ScholarCross Ref
- Advait Sarkar, Mateja Jamnik, Alan F. Blackwell, and Martin Spott. 2015. Interactive visual machine learning in spreadsheets. In Visual Languages and Human-Centric Computing (VL/HCC), 2015 IEEE Symposium on. IEEE, 159--163.Google ScholarCross Ref
- LL Thurstone. 1927. A law of comparative judgment. Psychol Rev 34 (1927), 273--286.Google ScholarCross Ref
- Job Van Exel and Gjalt de Graaf. 2005. Q methodology: A sneak preview. http://www.qmethodology.net/PDF/Q-methodology. (2005). Accessed: Friday 8th January, 2016.Google Scholar
- Carl Vondrick, Donald Patterson, and Deva Ramanan. 2013. Efficiently scaling up crowdsourced video annotation: A set of best practices for high quality, economical video labeling. International Journal of Computer Vision 101, 1 (2013), 184--204. DOI: http://dx.doi.org/10.1007/s11263-012-0564--1 Google ScholarDigital Library
- Y. Yan, R. Rosales, G. Fung, M. W. Schmidt, G. H. Valadez, L. Bogoni, L Moy, and J. G. Dy. 2010. Modeling annotator expertise: Learning when everybody knows a bit of something. (pp. 932--939).. In International conference on artificial intelligence and statistics. 932--939.Google Scholar
Index Terms
- Setwise Comparison: Consistent, Scalable, Continuum Labels for Computer Vision
Recommendations
Structured labeling for facilitating concept evolution in machine learning
CHI '14: Proceedings of the SIGCHI Conference on Human Factors in Computing SystemsLabeling data is a seemingly simple task required for training many machine learning systems, but is actually fraught with problems. This paper introduces the notion of concept evolution, the changing nature of a person's underlying concept (the ...
Transductive Multilabel Learning via Label Set Propagation
The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...
A study on zero-shot learning from semantic viewpoint
AbstractRecognition of unseen object class by a human being is always based on the relationship between seen and unseen classes, given that human has some background knowledge of the unseen object class. Zero-shot learning is a learning paradigm that ...
Comments