Crowdsourcing Human Annotation on Web Page Structure: Infrastructure Design and Behavior-Based Quality Control

Authors:
Shuguang Han

University of Pittsburgh, Pittsburgh, PA

University of Pittsburgh, Pittsburgh, PA
View Profile

,
Peng Dai

Google

Google
View Profile

,
Praveen Paritosh

Google

Google
View Profile

,
David Huynh

Google

Google
View Profile

ACM Transactions on Intelligent Systems and Technology Volume 7 Issue 4Article No.: 56pp 1–25https://doi.org/10.1145/2870649

Published:25 April 2016Publication History

ACM Transactions on Intelligent Systems and Technology

Abstract

Parsing the semantic structure of a web page is a key component of web information extraction. Successful extraction algorithms usually require large-scale training and evaluation datasets, which are difficult to acquire. Recently, crowdsourcing has proven to be an effective method of collecting large-scale training data in domains that do not require much domain knowledge. For more complex domains, researchers have proposed sophisticated quality control mechanisms to replicate tasks in parallel or sequential ways and then aggregate responses from multiple workers. Conventional annotation integration methods often put more trust in the workers with high historical performance; thus, they are called performance-based methods. Recently, Rzeszotarski and Kittur have demonstrated that behavioral features are also highly correlated with annotation quality in several crowdsourcing applications. In this article, we present a new crowdsourcing system, called Wernicke, to provide annotations for web information extraction. Wernicke collects a wide set of behavioral features and, based on these features, predicts annotation quality for a challenging task domain: annotating web page structure. We evaluate the effectiveness of quality control using behavioral features through a case study where 32 workers annotate 200 Q&A web pages from five popular websites. In doing so, we discover several things: (1) Many behavioral features are significant predictors for crowdsourcing quality. (2) The behavioral-feature-based method outperforms performance-based methods in recall prediction, while performing equally with precision prediction. In addition, using behavioral features is less vulnerable to the cold-start problem, and the corresponding prediction model is more generalizable for predicting recall than precision for cross-website quality analysis. (3) One can effectively combine workers’ behavioral information and historical performance information to further reduce prediction errors.

References

Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open information extraction for the web. In International Joint Conference on Artificial Intelligence, Vol. 7. 2670--2676. Google ScholarDigital Library
Michael S. Bernstein, Greg Little, Robert C. Miller, Björn Hartmann, Mark S. Ackerman, David R. Karger, David Crowell, and Katrina Panovich. 2010. Soylent: A word processor with a crowd inside. In Proceedings of the 23nd Annual ACM Symposium on User Interface Software and Technology. ACM, 313--322. Google ScholarDigital Library
Chia Hui Chang, Mohammed Kayed, Moheb R. Girgis, and Khaled F. Shaalan. 2006. A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering 18, 10 (2006), 1411--1428. Google ScholarDigital Library
Peng Dai, Christopher H. Lin, and Daniel S. Weld. 2013. POMDP-based control of workflows for crowdsourcing. Artificial Intelligence 202 (2013), 52--85.Google ScholarDigital Library
Peng Dai, Jeffrey Rzeszotarski, Praveen Paritosh, and Ed Chi. 2015. And now for something completely different: Improving crowdsourcing workflows with micro-diversions. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing (CSCW’’15). ACM. Google ScholarDigital Library
Alexander Philip Dawid and Allan M. Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied Statistics 28, 1 (1979), 20--28.Google ScholarCross Ref
Ofer Dekel and Ohad Shamir. 2009. Vox populi: Collecting high-quality labels from a crowd. In Proceedings of the 22nd Annual Conference on Learning Theory.Google Scholar
Julie S. Downs, Mandy B. Holbrook, Steve Sheng, and Lorrie Faith Cranor. 2010. Are your participants gaming the system? Screening mechanical turk workers. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2399--2402. Google ScholarDigital Library
Qi Guo, Haojian Jin, Dmitry Lagun, Shuai Yuan, and Eugene Agichtein. 2013. Mining touch interaction data on mobile devices to predict web search result relevance. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 153--162. Google ScholarDigital Library
Shuguang Han, Zhen Yue, and Daqing He. 2015. Understanding and supporting cross-device web search for exploratory tasks with mobile touch interactions. ACM Transactions on Information Systems (TOIS) 33, 4 (2015), 16. Google ScholarDigital Library
Andrew Hogue and David Karger. 2005. Thresher: Automating the unwrapping of semantic content from the world wide web. In Proceedings of the 14th International Conference on World Wide Web. ACM, 86--95. Google ScholarDigital Library
Jeff Howe. 2006. The rise of crowdsourcing. Wired Magazine 14, 6 (2006), 1--4.Google Scholar
Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for implicit feedback datasets. In Proceedings of th 8th IEEE International Conference on Data Mining, 2008 (ICDM’08). IEEE, 263--272. Google ScholarDigital Library
Jeff Huang, Ryen W. White, and Susan Dumais. 2011. No clicks, no problem: Using cursor movements to understand and improve search. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1225--1234. Google ScholarDigital Library
David Huynh, Stefano Mazzocchi, and David Karger. 2005. Piggy bank: Experience the semantic web inside your web browser. In The Semantic Web (ISWC’05). Springer, 413--430. Google ScholarDigital Library
David F. Huynh, Robert C. Miller, and David R. Karger. 2006. Enabling web browsers to augment web sites’ filtering and sorting functionalities. In Proceedings of the 19th Annual ACM Symposium on User Interface Software and Technology. ACM, 125--134. Google ScholarDigital Library
Rob J. Hyndman and Anne B. Koehler. 2006. Another look at measures of forecast accuracy. International Journal of Forecasting 22, 4 (2006), 679--688.Google ScholarCross Ref
Panagiotis G. Ipeirotis, Foster Provost, and Jing Wang. 2010. Quality management on Amazon mechanical turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation. ACM, 64--67. Google ScholarDigital Library
D. N. Joanes and C. A. Gill. 1998. Comparing measures of sample skewness and kurtosis. Journal of the Royal Statistical Society: Series D (The Statistician) 47, 1 (1998), 183--189.Google ScholarCross Ref
Aniket Kittur, Ed H. Chi, and Bongwon Suh. 2008. Crowdsourcing user studies with mechanical turk. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 453--456. Google ScholarDigital Library
Christian Kohlschütter, Peter Fankhauser, and Wolfgang Nejdl. 2010. Boilerplate detection using shallow text features. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. ACM, 441--450. Google ScholarDigital Library
Hongwei Li and Bin Yu. 2014. Error rate bounds and iterative weighted majority voting for crowdsourcing. arXiv preprint arXiv:1411.4086 (2014).Google Scholar
Greg Little, Lydia B. Chilton, Max Goldman, and Robert C. Miller. 2009. Turkit: Tools for iterative tasks on mechanical turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation. ACM, 29--30. Google ScholarDigital Library
Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman. 2008. LabelMe: A database and web-based tool for image annotation. International Journal of Computer Vision 77, 1--3 (2008), 157--173. Google ScholarDigital Library
Jeffrey Rzeszotarski and Aniket Kittur. 2012. CrowdScape: Interactively visualizing user behavior and output. In Proceedings of the 25th Annual ACM Symposium on User Interface Software and Technology. ACM, 55--62. Google ScholarDigital Library
Jeffrey M. Rzeszotarski and Aniket Kittur. 2011. Instrumenting the crowd: Using implicit behavioral measures to predict task performance. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology. ACM, 13--22. Google ScholarDigital Library
Xinying Song, Jing Liu, Yunbo Cao, Chin-Yew Lin, and Hsiao-Wuen Hon. 2010. Automatic extraction of web data records containing user-generated content. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. ACM, 39--48. Google ScholarDigital Library
Pontus Stenetorp, Sampo Pyysalo, Goran Topić, Tomoko Ohta, Sophia Ananiadou, and Jun’ichi Tsujii. 2012. BRAT: A web-based tool for NLP-assisted text annotation. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 102--107. Google ScholarDigital Library
Fei Wu, Raphael Hoffmann, and Daniel S. Weld. 2008. Information extraction from wikipedia: Moving down the long tail. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 731--739. Google ScholarDigital Library
Fei Wu and Daniel S. Weld. 2010. Open information extraction using wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 118--127. Google ScholarDigital Library
Xing Yi, Liangjie Hong, Erheng Zhong, Nanthan Nan Liu, and Suju Rajan. 2014. Beyond clicks: Dwell time for personalization. In Proceedings of the 8th ACM Conference on Recommender systems. ACM, 113--120. Google ScholarDigital Library

Index Terms

Crowdsourcing Human Annotation on Web Page Structure: Infrastructure Design and Behavior-Based Quality Control
1. Human-centered computing

Recommendations

Quality Control in Crowdsourcing based on Fine-Grained Behavioral Features
CSCW2

Crowdsourcing is popular for large-scale data collection and labeling, but a major challenge is on detecting low-quality submissions. Recent studies have demonstrated that behavioral features of workers are highly correlated with data quality and can be ...
Read More
Crowdsourcing for web genre annotation

Recently, genre collection and automatic genre identification for the web has attracted much attention. However, currently there is no genre-annotated corpus of web pages where inter-annotator reliability has been established, i.e. the corpora are ...
Read More
Do extra dollars paid-off?: an exploratory study on topcoder
CSI-SE '18: Proceedings of the 5th International Workshop on Crowd Sourcing in Software Engineering

In general crowdsourcing, different task requesters employ different pricing strategies to balance task cost and expected worker performance. While most existing studies show that increasing incentives tend to benefit crowdsourcing outcomes, i.e. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Intelligent Systems and Technology Volume 7, Issue 4
Special Issue on Crowd in Intelligent Systems, Research Note/Short Paper and Regular Papers
July 2016
498 pages
ISSN:2157-6904
EISSN:2157-6912
DOI:10.1145/2906145
Editor:
Yu Zheng
Microsoft Research, China
Issue’s Table of Contents
Copyright © 2016 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 April 2016
- Revised: 1 December 2015
- Accepted: 1 December 2015
- Received: 1 February 2015
Published in tist Volume 7, Issue 4

Check for updates
Author Tags
Crowdsourcing
behavioral features
quality control
worker performance
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 779
  Total Downloads
- Downloads (Last 12 months)87
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Crowdsourcing Human Annotation on Web Page Structure: Infrastructure Design and Behavior-Based Quality Control

ACM Transactions on Intelligent Systems and Technology

Abstract

References

Cited By

Index Terms

Recommendations

Quality Control in Crowdsourcing based on Fine-Grained Behavioral Features

Crowdsourcing for web genre annotation

Do extra dollars paid-off?: an exploratory study on topcoder