Abstract
Crowdsourcing is popular for large-scale data collection and labeling, but a major challenge is on detecting low-quality submissions. Recent studies have demonstrated that behavioral features of workers are highly correlated with data quality and can be useful in quality control. However, these studies primarily leveraged coarsely extracted behavioral features, and did not further explore quality control at the fine-grained level, i.e., the annotation unit level. In this paper, we investigate the feasibility and benefits of using fine-grained behavioral features, which are the behavioral features finely extracted from a worker's individual interactions with each single unit in a subtask, for quality control in crowdsourcing. We design and implement a framework named Fine-grained Behavior-based Quality Control (FBQC) that specifically extracts fine-grained behavioral features to provide three quality control mechanisms: (1) quality prediction for objective tasks, (2) suspicious behavior detection for subjective tasks, and (3) unsupervised worker categorization. Using the FBQC framework, we conduct two real-world crowdsourcing experiments and demonstrate that using fine-grained behavioral features is feasible and beneficial in all three quality control mechanisms. Our work provides clues and implications for helping job requesters or crowdsourcing platforms to further achieve better quality control.
- Nat a M. Barbosa and Monchu Chen. 2019. Rehumanized Crowdsourcing: A Labeling Framework Addressing Bias and Ethics in Machine Learning. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI).Google Scholar
- Alessandro Checco, Jo Bates, and Gianluca Demartini. 2020. Adversarial Attacks on Crowdsourcing Quality Control. Journal of Artificial Intelligence Research (2020).Google Scholar
- Sheng-Yeh Chen, Chao-Chun Hsu, Chuan-Chun Kuo, Lun-Wei Ku, et almbox. 2018. Emotionlines: An emotion corpus of multi-party conversations. arXiv preprint arXiv:1802.08379 (2018).Google Scholar
- Alexander Philip Dawid and Allan M Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics) (1979).Google Scholar
- Ujwal Gadiraju, Gianluca Demartini, Ricardo Kawase, and Stefan Dietze. 2019. Crowd anatomy beyond the good and bad: Behavioral traces for crowd worker modeling and pre-selection. Computer Supported Cooperative Work (CSCW) (2019).Google Scholar
- Ujwal Gadiraju, Ricardo Kawase, Stefan Dietze, and Gianluca Demartini. 2015. Understanding malicious behavior in crowdsourcing platforms: The case of online surveys. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI).Google ScholarDigital Library
- Tanya Goyal, Tyler McDonnell, Mucahid Kutlu, Tamer Elsayed, and Matthew Lease. 2018. Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to Ensure Quality Relevance Annotations. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing (HCOMP).Google ScholarCross Ref
- Shuguang Han, Peng Dai, Praveen Paritosh, and David Huynh. 2016. Crowdsourcing human annotation on web page structure: Infrastructure design and behavior-based quality control. ACM Transactions on Intelligent Systems and Technology (TIST) (2016).Google ScholarDigital Library
- Kotaro Hara, Abigail Adams, Kristy Milland, Saiph Savage, Chris Callison-Burch, and Jeffrey P Bigham. 2018. A data-driven analysis of workers' earnings on Amazon Mechanical Turk. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI).Google ScholarDigital Library
- Eric Heim, Alexander Seitel, Jonas Andrulis, Fabian Isensee, Christian Stock, Tobias Ross, and Lena Maier-Hein. 2017. Clickstream analysis for crowd-based object segmentation with confidence. IEEE transactions on pattern analysis and machine intelligence (2017).Google Scholar
- Danula Hettiachchi, Niels van Berkel, Vassilis Kostakos, and Jorge Goncalves. 2020. CrowdCog: A Cognitive skill based system for heterogeneous task assignment and recommendation in crowdsourcing. Proceedings of the ACM on Human-Computer Interaction CSCW2 (2020).Google ScholarDigital Library
- Matthias Hirth, Sven Scheuring, Tobias Hoßfeld, Christian Schwartz, and Phuoc Tran-Gia. 2014. Predicting result quality in crowdsourcing using application layer monitoring. In Proceedings of IEEE International Conference on Communications and Electronics (ICCE).Google ScholarCross Ref
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation (1997).Google Scholar
- Ayush Jain, Akash Das Sarma, Aditya Parameswaran, and Jennifer Widom. 2017. Understanding workers, developing effective tasks, and enhancing marketplace dynamics: a study of a large crowdsourcing marketplace. Proceedings of the VLDB Endowment (2017).Google ScholarDigital Library
- Gabriella Kazai, Jaap Kamps, and Natasa Milic-Frayling. 2011. Worker types and personality traits in crowdsourcing relevance labels. In Proceedings of the ACM International Conference on Information and Knowledge Management.Google ScholarDigital Library
- Gabriella Kazai and Imed Zitouni. 2016. Quality management in crowdsourcing using gold judges behavior. In Proceedings of the ACM International Conference on Web Search and Data Mining.Google ScholarDigital Library
- Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. 2020. The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. IJCV (2020).Google ScholarCross Ref
- Walter S Lasecki, Jaime Teevan, and Ece Kamar. 2014. Information extraction and manipulation threats in crowd-powered systems. In Proceedings of the ACM conference on Computer Supported Cooperative Work & Social Computing (CSCW).Google ScholarDigital Library
- Hongwei Li and Bin Yu. 2014. Error rate bounds and iterative weighted majority voting for crowdsourcing. arXiv preprint arXiv:1411.4086 (2014).Google Scholar
- Ioanna Lykourentzou, Angeliki Antoniou, Yannick Naudet, and Steven P Dow. 2016. Personality matters: Balancing for personality types leads to better outcomes for crowd teams. In Proceedings of the ACM Conference on Computer-Supported Cooperative Work & Social Computing (CSCW).Google ScholarDigital Library
- Chris Madge, Juntao Yu, Jon Chamberlain, Udo Kruschwitz, Silviu Paun, and Massimo Poesio. 2019. Crowdsourcing and aggregating nested markable annotations. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.Google ScholarCross Ref
- Christopher D Manning, Hinrich Schütze, and Prabhakar Raghavan. 2008. Introduction to information retrieval .Cambridge university press.Google Scholar
- Andrew Mao, Ece Kamar, Yiling Chen, Eric Horvitz, Megan Schwamb, Chris Lintott, and Arfon Smith. 2013. Volunteering versus work for pay: Incentives and tradeoffs in crowdsourcing. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing (HCOMP).Google ScholarCross Ref
- Asako Miura and Tetsuro Kobayashi. 2016. Survey satisficing inflates stereotypical responses in online experiment: The case of immigration study. Frontiers in psychology (2016).Google Scholar
- Ricky KP Mok, Rocky KC Chang, and Weichao Li. 2016. Detecting low-quality workers in QoE crowdtesting: A worker behavior-based approach. IEEE Transactions on Multimedia (2016).Google Scholar
- Daniel M Oppenheimer, Tom Meyvis, and Nicolas Davidenko. 2009. Instructional manipulation checks: Detecting satisficing to increase statistical power. Journal of experimental social psychology (2009).Google Scholar
- Gabriele Paolacci, Jesse Chandler, and Panagiotis G Ipeirotis. 2010. Running experiments on amazon mechanical turk. udgment and Decision Making (2010).Google Scholar
- Weiping Pei, Arthur Mayer, Kaylynn Tu, and Chuan Yue. 2020. Attention please: Your attention check questions in survey studies can be automatically answered. In Proceedings of The Web Conference 2020.Google ScholarDigital Library
- Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. 2019. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 527--536.Google ScholarCross Ref
- Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. 2019. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
- Jeffrey M Rzeszotarski and Aniket Kittur. 2011. Instrumenting the crowd: using implicit behavioral measures to predict task performance. In Proceedings of the ACM Symposium on User Interface Software and Technology.Google ScholarDigital Library
- Susumu Saito, Chun-Wei Chiang, Saiph Savage, Teppei Nakano, Tetsunori Kobayashi, and Jeffrey P Bigham. 2019. TurkScanner: Predicting the hourly wage of microtasks. In Proceedings of The World Wide Web Conference (WWW).Google ScholarDigital Library
- Rachel N Simons, Danna Gurari, and Kenneth R Fleischmann. 2020. " I Hope This Is Helpful" Understanding Crowdworkers' Challenges and Motivations for an Image Description Task. Proceedings of the ACM on Human-Computer Interaction CSCW2 (2020).Google Scholar
- Yu Suzuki, Yoshitaka Matsuda, and Satoshi Nakamura. 2019. Additional operations of simple HITs on microtask crowdsourcing for worker quality prediction. Journal of Information Processing (2019).Google Scholar
- Jeroen BP Vuurens and Arjen P De Vries. 2012. Obtaining high-quality relevance judgments using crowdsourcing. IEEE Internet Computing (2012).Google ScholarDigital Library
- Haijun Zhai, Todd Lingren, Louise Deleger, Qi Li, Megan Kaiser, Laura Stoutenborough, and Imre Solti. 2013. Web 2.0-based crowdsourcing for high-quality gold standard development in clinical natural language processing. Journal of medical Internet research (2013).Google Scholar
Index Terms
- Quality Control in Crowdsourcing based on Fine-Grained Behavioral Features
Recommendations
Quality Control in Crowdsourcing Systems: Issues and Directions
As a new distributed computing model, crowdsourcing lets people leverage the crowd's intelligence and wisdom toward solving problems. This article proposes a framework for characterizing various dimensions of quality control in crowdsourcing systems, a ...
Evolutionary approach for crowdsourcing quality control
Crowdsourcing is widely used for solving simple tasks (e.g. tagging images) and recently, some researchers (Kittur et al., 2011 9 and Kulkarni et al., 2012 10) propose new crowdsourcing models to handle complex tasks (e.g. article writing). In both type ...
Crowdsourcing Quality Concerns: An Examination of Amazon’s Mechanical Turk
SIGITE '22: Proceedings of the 23rd Annual Conference on Information Technology EducationThe use of crowdsourcing platforms, such as Amazon’s Mechanical Turk (MTurk), have been an effective and frequent tool for researchers to gather data from participants for a study. It provides a fast, efficient, and cost-effective method for acquiring ...
Comments