Abstract
Parsing the semantic structure of a web page is a key component of web information extraction. Successful extraction algorithms usually require large-scale training and evaluation datasets, which are difficult to acquire. Recently, crowdsourcing has proven to be an effective method of collecting large-scale training data in domains that do not require much domain knowledge. For more complex domains, researchers have proposed sophisticated quality control mechanisms to replicate tasks in parallel or sequential ways and then aggregate responses from multiple workers. Conventional annotation integration methods often put more trust in the workers with high historical performance; thus, they are called performance-based methods. Recently, Rzeszotarski and Kittur have demonstrated that behavioral features are also highly correlated with annotation quality in several crowdsourcing applications. In this article, we present a new crowdsourcing system, called Wernicke, to provide annotations for web information extraction. Wernicke collects a wide set of behavioral features and, based on these features, predicts annotation quality for a challenging task domain: annotating web page structure. We evaluate the effectiveness of quality control using behavioral features through a case study where 32 workers annotate 200 Q&A web pages from five popular websites. In doing so, we discover several things: (1) Many behavioral features are significant predictors for crowdsourcing quality. (2) The behavioral-feature-based method outperforms performance-based methods in recall prediction, while performing equally with precision prediction. In addition, using behavioral features is less vulnerable to the cold-start problem, and the corresponding prediction model is more generalizable for predicting recall than precision for cross-website quality analysis. (3) One can effectively combine workers’ behavioral information and historical performance information to further reduce prediction errors.
- Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open information extraction for the web. In International Joint Conference on Artificial Intelligence, Vol. 7. 2670--2676. Google ScholarDigital Library
- Michael S. Bernstein, Greg Little, Robert C. Miller, Björn Hartmann, Mark S. Ackerman, David R. Karger, David Crowell, and Katrina Panovich. 2010. Soylent: A word processor with a crowd inside. In Proceedings of the 23nd Annual ACM Symposium on User Interface Software and Technology. ACM, 313--322. Google ScholarDigital Library
- Chia Hui Chang, Mohammed Kayed, Moheb R. Girgis, and Khaled F. Shaalan. 2006. A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering 18, 10 (2006), 1411--1428. Google ScholarDigital Library
- Peng Dai, Christopher H. Lin, and Daniel S. Weld. 2013. POMDP-based control of workflows for crowdsourcing. Artificial Intelligence 202 (2013), 52--85.Google ScholarDigital Library
- Peng Dai, Jeffrey Rzeszotarski, Praveen Paritosh, and Ed Chi. 2015. And now for something completely different: Improving crowdsourcing workflows with micro-diversions. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing (CSCW’’15). ACM. Google ScholarDigital Library
- Alexander Philip Dawid and Allan M. Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied Statistics 28, 1 (1979), 20--28.Google ScholarCross Ref
- Ofer Dekel and Ohad Shamir. 2009. Vox populi: Collecting high-quality labels from a crowd. In Proceedings of the 22nd Annual Conference on Learning Theory.Google Scholar
- Julie S. Downs, Mandy B. Holbrook, Steve Sheng, and Lorrie Faith Cranor. 2010. Are your participants gaming the system? Screening mechanical turk workers. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2399--2402. Google ScholarDigital Library
- Qi Guo, Haojian Jin, Dmitry Lagun, Shuai Yuan, and Eugene Agichtein. 2013. Mining touch interaction data on mobile devices to predict web search result relevance. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 153--162. Google ScholarDigital Library
- Shuguang Han, Zhen Yue, and Daqing He. 2015. Understanding and supporting cross-device web search for exploratory tasks with mobile touch interactions. ACM Transactions on Information Systems (TOIS) 33, 4 (2015), 16. Google ScholarDigital Library
- Andrew Hogue and David Karger. 2005. Thresher: Automating the unwrapping of semantic content from the world wide web. In Proceedings of the 14th International Conference on World Wide Web. ACM, 86--95. Google ScholarDigital Library
- Jeff Howe. 2006. The rise of crowdsourcing. Wired Magazine 14, 6 (2006), 1--4.Google Scholar
- Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for implicit feedback datasets. In Proceedings of th 8th IEEE International Conference on Data Mining, 2008 (ICDM’08). IEEE, 263--272. Google ScholarDigital Library
- Jeff Huang, Ryen W. White, and Susan Dumais. 2011. No clicks, no problem: Using cursor movements to understand and improve search. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1225--1234. Google ScholarDigital Library
- David Huynh, Stefano Mazzocchi, and David Karger. 2005. Piggy bank: Experience the semantic web inside your web browser. In The Semantic Web (ISWC’05). Springer, 413--430. Google ScholarDigital Library
- David F. Huynh, Robert C. Miller, and David R. Karger. 2006. Enabling web browsers to augment web sites’ filtering and sorting functionalities. In Proceedings of the 19th Annual ACM Symposium on User Interface Software and Technology. ACM, 125--134. Google ScholarDigital Library
- Rob J. Hyndman and Anne B. Koehler. 2006. Another look at measures of forecast accuracy. International Journal of Forecasting 22, 4 (2006), 679--688.Google ScholarCross Ref
- Panagiotis G. Ipeirotis, Foster Provost, and Jing Wang. 2010. Quality management on Amazon mechanical turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation. ACM, 64--67. Google ScholarDigital Library
- D. N. Joanes and C. A. Gill. 1998. Comparing measures of sample skewness and kurtosis. Journal of the Royal Statistical Society: Series D (The Statistician) 47, 1 (1998), 183--189.Google ScholarCross Ref
- Aniket Kittur, Ed H. Chi, and Bongwon Suh. 2008. Crowdsourcing user studies with mechanical turk. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 453--456. Google ScholarDigital Library
- Christian Kohlschütter, Peter Fankhauser, and Wolfgang Nejdl. 2010. Boilerplate detection using shallow text features. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. ACM, 441--450. Google ScholarDigital Library
- Hongwei Li and Bin Yu. 2014. Error rate bounds and iterative weighted majority voting for crowdsourcing. arXiv preprint arXiv:1411.4086 (2014).Google Scholar
- Greg Little, Lydia B. Chilton, Max Goldman, and Robert C. Miller. 2009. Turkit: Tools for iterative tasks on mechanical turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation. ACM, 29--30. Google ScholarDigital Library
- Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman. 2008. LabelMe: A database and web-based tool for image annotation. International Journal of Computer Vision 77, 1--3 (2008), 157--173. Google ScholarDigital Library
- Jeffrey Rzeszotarski and Aniket Kittur. 2012. CrowdScape: Interactively visualizing user behavior and output. In Proceedings of the 25th Annual ACM Symposium on User Interface Software and Technology. ACM, 55--62. Google ScholarDigital Library
- Jeffrey M. Rzeszotarski and Aniket Kittur. 2011. Instrumenting the crowd: Using implicit behavioral measures to predict task performance. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology. ACM, 13--22. Google ScholarDigital Library
- Xinying Song, Jing Liu, Yunbo Cao, Chin-Yew Lin, and Hsiao-Wuen Hon. 2010. Automatic extraction of web data records containing user-generated content. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. ACM, 39--48. Google ScholarDigital Library
- Pontus Stenetorp, Sampo Pyysalo, Goran Topić, Tomoko Ohta, Sophia Ananiadou, and Jun’ichi Tsujii. 2012. BRAT: A web-based tool for NLP-assisted text annotation. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 102--107. Google ScholarDigital Library
- Fei Wu, Raphael Hoffmann, and Daniel S. Weld. 2008. Information extraction from wikipedia: Moving down the long tail. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 731--739. Google ScholarDigital Library
- Fei Wu and Daniel S. Weld. 2010. Open information extraction using wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 118--127. Google ScholarDigital Library
- Xing Yi, Liangjie Hong, Erheng Zhong, Nanthan Nan Liu, and Suju Rajan. 2014. Beyond clicks: Dwell time for personalization. In Proceedings of the 8th ACM Conference on Recommender systems. ACM, 113--120. Google ScholarDigital Library
Index Terms
- Crowdsourcing Human Annotation on Web Page Structure: Infrastructure Design and Behavior-Based Quality Control
Recommendations
Quality Control in Crowdsourcing based on Fine-Grained Behavioral Features
CSCW2Crowdsourcing is popular for large-scale data collection and labeling, but a major challenge is on detecting low-quality submissions. Recent studies have demonstrated that behavioral features of workers are highly correlated with data quality and can be ...
Crowdsourcing for web genre annotation
Recently, genre collection and automatic genre identification for the web has attracted much attention. However, currently there is no genre-annotated corpus of web pages where inter-annotator reliability has been established, i.e. the corpora are ...
Do extra dollars paid-off?: an exploratory study on topcoder
CSI-SE '18: Proceedings of the 5th International Workshop on Crowd Sourcing in Software EngineeringIn general crowdsourcing, different task requesters employ different pricing strategies to balance task cost and expected worker performance. While most existing studies show that increasing incentives tend to benefit crowdsourcing outcomes, i.e. ...
Comments