Abstract
Thanks to information explosion, data for the objects of interest can be collected from increasingly more sources. However, for the same object, there usually exist conflicts among the collected multi-source information. To tackle this challenge, truth discovery, which integrates multi-source noisy information by estimating the reliability of each source, has emerged as a hot topic. Several truth discovery methods have been proposed for various scenarios, and they have been successfully applied in diverse application domains. In this survey, we focus on providing a comprehensive overview of truth discovery methods, and summarizing them from different aspects. We also discuss some future directions of truth discovery research. We hope that this survey will promote a better understanding of the current progress on truth discovery, and offer some guidelines on how to apply these approaches in application domains.
- Amazon mechanical turk. https://www.mturk.com/mturk/welcome.Google Scholar
- Freebase. https://www.freebase.com/.Google Scholar
- Google knowledge graph. http://www.google.com/insidesearch/features/search/knowledge.html.Google Scholar
- Yago. http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/.Google Scholar
- C. C. Aggarwal and T. Abdelzaher. Social sensing. In Managing and mining sensor data, pages 237--297. 2013.Google ScholarCross Ref
- B. Aydin, Y. Yilmaz, Y. Li, Q. Li, J. Gao, and M. Demirbas. Crowdsourcing for multiple-choice question answering. In Proc. of the Conference on Innovative Applications of Artificial Intelligence (IAAI'14), pages 2946--2953, 2014. Google ScholarDigital Library
- D. P. Bertsekas. Non-linear Programming. Athena Scientific, 2nd edition, 1999.Google Scholar
- S. Bickel and T. Scheffer. Multi-view clustering. In Proc. of the IEEE International Conference on Data Mining (ICDM'04), pages 19--26, 2004. Google ScholarDigital Library
- J. Bleiholder and F. Naumann. Conflict handling strategies in an integrated information system. In Proc. of the International Workshop on Information Integration on the Web (IIWeb'06), 2006.Google Scholar
- J. Bleiholder and F. Naumann. Data fusion. ACM Computing Surveys, 41(1):1:1--1:41, 2009. Google ScholarDigital Library
- A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proc. of the annual conference on Computational learning theory (COLT'98), pages 92--100, 1998. Google ScholarDigital Library
- A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics, pages 20--28, 1979.Google Scholar
- R. DerSimonian and N. Laird. Meta-analysis in clinical trials. Controlled clinical trials, 7(3):177--188, 1986.Google Scholar
- X. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. PVLDB, 3(1):1358--1369, 2010. Google ScholarDigital Library
- X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: The role of source dependence. PVLDB, 2(1):550--561, 2009. Google ScholarDigital Library
- X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1):562--573, 2009. Google ScholarDigital Library
- X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'14), pages 601--610, 2014. Google ScholarDigital Library
- X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, K. Murphy, S. Sun, and W. Zhang. From data fusion to knowledge fusion. PVLDB, 7(10):881--892, 2014. Google ScholarDigital Library
- X. L. Dong, E. Gabrilovich, K. Murphy, V. Dang, W. Horn, C. Lugaresi, S. Sun, and W. Zhang. Knowledge-based trust: Estimating the trustworthiness of web sources. PVLDB, 8(9):938--949, 2015. Google ScholarDigital Library
- X. L. Dong and F. Naumann. Data fusion: Resolving data conflicts for integration. PVLDB, 2(2):1654--1655, 2009. Google ScholarDigital Library
- X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. PVLDB, 6(2):37--48, 2012. Google ScholarDigital Library
- X. L. Dong and D. Srivastava. Compact explanation of data fusion decisions. In Proc. of the International Conference on World Wide Web (WWW'13), pages 379--390, 2013. Google ScholarDigital Library
- C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In Proc. of the International Conference on World Wide Web (WWW'01), pages 613--622, 2001. Google ScholarDigital Library
- A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In Proc. of the ACM International Conference on Web Search and Data Mining (WSDM'10), pages 131--140, 2010. Google ScholarDigital Library
- M. Gupta, Y. Sun, and J. Han. Trust analysis with clustering. In Proc. of the International Conference on World Wide Web (WWW'11), pages 53--54, 2011. Google ScholarDigital Library
- H. Le, D. Wang, H. Ahmadi, Y. S. Uddin, B. Szymanski, R. Ganti, and T. Abdelzaher. Demo: Distilling likely truth from noisy streaming data with apollo. In Proc. of the ACM International Conference on Embedded Networked Sensor Systems (Sensys'11), pages 417--418, 2011. Google ScholarDigital Library
- F. Li, M. L. Lee, and W. Hsu. Entity profiling with varying source reliabilities. In Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'14), pages 1146--1155, 2014. Google ScholarDigital Library
- H. Li, B. Zhao, and A. Fuxman. The wisdom of minority: discovering and targeting the right group of workers for crowdsourcing. In Proc. of the International Conference on World Wide Web (WWW'14), pages 165--176, 2014. Google ScholarDigital Library
- Q. Li, Y. Li, J. Gao, L. Su, B. Zhao, D. Murat, W. Fan, and J. Han. A confidence-aware approach for truth discovery on long-tail data. PVLDB, 8(4):425--436, 2015. Google ScholarDigital Library
- Q. Li, Y. Li, J. Gao, B. Zhao, W. Fan, and J. Han. Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In Proc. of the ACM SIGMOD International Conference on Management of Data (SIGMOD'14), pages 1187--1198, 2014. Google ScholarDigital Library
- X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2):97--108, 2012. Google ScholarDigital Library
- Y. Li, Q. Li, J. Gao, L. Su, B. Zhao, W. Fan, and J. Han. On the discovery of evolving truth. In Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'15), pages 675--684, 2015. Google ScholarDigital Library
- S. Lin. Rank aggregation methods. Wiley Interdisciplinary Reviews: Computational Statistics, 2(5):555--570, 2010.Google ScholarDigital Library
- M. W. Lipsey and D. B. Wilson. Practical metaanalysis, volume 49. 2001.Google Scholar
- X. Liu, X. L. Dong, B. C. Ooi, and D. Srivastava. Online data fusion. PVLDB, 4(11):932--943, 2011.Google ScholarDigital Library
- R. C. Luo, C.-C. Yih, and K. L. Su. Multisensor fusion and integration: approaches, applications, and future research directions. IEEE Sensors Journal, 2(2):107--119, 2002.Google ScholarCross Ref
- F. Ma, Y. Li, Q. Li, M. Qiu, J. Gao, S. Zhi, L. Su, B. Zhao, H. Ji, and J. Han. Faitcrowd: Fine grained truth discovery for crowdsourced data aggregation. In Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'15), pages 745--754, 2015. Google ScholarDigital Library
- A. Marian and M. Wu. Corroborating information from web sources. IEEE Data Engineering Bulletin, 34(3):11--17, 2011.Google Scholar
- C. Meng, W. Jiang, Y. Li, J. Gao, L. Su, H. Ding, and Y. Cheng. Truth discovery on crowd sensing of correlated entities. In Proc. of the ACM International Conference on Embedded Networked Sensor Systems (Sensys'15), 2015. Google ScholarDigital Library
- C. Miao, W. Jiang, L. Su, Y. Li, S. Guo, Z. Qin, H. Xiao, J. Gao, and K. Ren. Cloud-enabled privacypreserving truth discovery in crowd sensing systems. In Proc. of the ACM International Conference on Embedded Networked Sensor Systems (Sensys'15), 2015. Google ScholarDigital Library
- H. B. Mitchell. Multi-sensor data fusion: an introduction. Springer Science & Business Media, 2007. Google ScholarDigital Library
- S. Mukherjee, G. Weikum, and C. Danescu-Niculescu- Mizil. People on drugs: credibility of user statements in health communities. In Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'14), pages 65--74, 2014. Google ScholarDigital Library
- V.-A. Nguyen, E.-P. Lim, J. Jiang, and A. Sun. To trust or not to trust? predicting online trusts using trust antecedent framework. In Proc. of the IEEE International Conference on Data Mining (ICDM'09), pages 896--901, 2009. Google ScholarDigital Library
- J. O'Donovan and B. Smyth. Trust in recommender systems. In Proc. of the international conference on Intelligent user interfaces (IUI'05), pages 167--174, 2005. Google ScholarDigital Library
- J. Pasternack and D. Roth. Comprehensive trust metrics for information networks. In Army Science Conference, 2010.Google Scholar
- J. Pasternack and D. Roth. Knowing what to believe (when you already know something). In Proc. of the International Conference on Computational Linguistics (COLING'10), pages 877--885, 2010. Google ScholarDigital Library
- J. Pasternack and D. Roth. Making better informed trust decisions with generalized fact-finding. In Proc. of the International Jont Conference on Artifical Intelligence (IJCAI'11), pages 2324--2329, 2011. Google ScholarDigital Library
- J. Pasternack and D. Roth. Latent credibility analysis. In Proc. of the International Conference on World Wide Web (WWW'13), pages 1009--1020, 2013. Google ScholarDigital Library
- R. Pochampally, A. D. Sarma, X. L. Dong, A. Meliou, and D. Srivastava. Fusing data with correlations. In Proc. of the ACM SIGMOD International Conference on Management of Data (SIGMOD'14), pages 433--444, 2014. Google ScholarDigital Library
- G.-J. Qi, C. C. Aggarwal, J. Han, and T. Huang. Mining collective intelligence in diverse groups. In Proc. of the International Conference on World Wide Web (WWW'13), pages 1041--1052, 2013. Google ScholarDigital Library
- V. C. Raykar, S. Yu, L. H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy. Supervised learning from multiple experts: Whom to trust when everyone lies a bit. In Proc. of the International Conference on Machine Learning (ICML'09), pages 889--896, 2009. Google ScholarDigital Library
- T. Rekatsinas, X. L. Dong, and D. Srivastava. Characterizing and selecting fresh data sources. In Proc. of the ACM SIGMOD International Conference on Management of Data (SIGMOD'14), pages 919--930, 2014. Google ScholarDigital Library
- A. D. Sarma, X. L. Dong, and A. Halevy. Data integration with dependent sources. In Proc. of the International Conference on Extending Database Technology (EDBT'11), pages 401--412, 2011. Google ScholarDigital Library
- G. Seni and J. F. Elder. Ensemble methods in data mining: improving accuracy through combining predictions. nSynthesis Lectures on Data Mining and Knowledge Discovery, 2(1):1--126, 2010. Google ScholarDigital Library
- V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labels. In Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'08), pages 614--622, 2008. Google ScholarDigital Library
- P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective labelling of venus images. In Advances in Neural Information Processing Systems (NIPS'95), pages 1085--1092, 1995.Google Scholar
- R. Snow, B. O'Connor, D. Jurafsky, and A. Ng. Cheap and fast - but is it good? evaluating non-expert annotations for natural language tasks. In Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP'08), pages 254--263, 2008. Google ScholarDigital Library
- A. Sorokin and D. Forsyth. Utility data annotation with amazon mechanical turk. In Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'08), pages 1--8, 2008.Google ScholarCross Ref
- M. Spain and P. Perona. Some objects are more equal than others: Measuring and predicting importance. In Proc. European Conference on Computer Vision (ECCV'08), pages 523--536, 2008. Google ScholarDigital Library
- L. Su, Q. Li, S. Hu, S. Wang, J. Gao, H. Liu, T. Abdelzaher, J. Han, X. Liu, Y. Gao, and L. Kaplan. Generalized decision aggregation in distributed sensing systems. In Proc. of the IEEE Real-Time Systems Symposium (RTSS'14), pages 1--10, 2014.Google ScholarCross Ref
- J. Tang and H. Liu. Trust in social computing. In Proc. of the international conference on World wide web companion, pages 207--208, 2014. Google ScholarDigital Library
- L.-A. Tang, X. Yu, S. Kim, Q. Gu, J. Han, A. Leung, and T. La Porta. Trustworthiness analysis of sensor data in cyber-physical systems. Journal of Computer and System Sciences, 79(3):383--401, 2013. Google ScholarDigital Library
- P. Victor, M. De Cock, and C. Cornelis. Trust and recommendations. In Recommender systems handbook, pages 645--675. Springer, 2011.Google ScholarCross Ref
- D. Wang, T. Abdelzaher, L. Kaplan, and C. Aggarwal. Recursive fact-finding: A streaming approach to truth estimation in crowdsourcing applications. In Proc. of the International Conference on Distributed Computing Systems (ICDCS'13), pages 530--539, 2013. Google ScholarDigital Library
- D. Wang, M. T. Amin, S. Li, T. Abdelzaher, L. Kaplan, S. Gu, C. Pan, H. Liu, C. C. Aggarwal, R. Ganti, et al. Using humans as sensors: An estimation-theoretic perspective. In Proc. of the International Conference on Information Processing in Sensor Networks (IPSN'14), pages 35--46, 2014. Google ScholarDigital Library
- D. Wang, L. Kaplan, and T. F. Abdelzaher. Maximum likelihood analysis of conflicting observations in social sensing. ACM Transactions on Sensor Networks (ToSN), 10(2):30, 2014. Google ScholarDigital Library
- D. Wang, L. Kaplan, H. Le, and T. Abdelzaher. On truth discovery in social sensing: A maximum likelihood estimation approach. In Proc. of the International Conference on Information Processing in Sensor Networks (IPSN'12), pages 233--244, 2012. Google ScholarDigital Library
- S. Wang, L. Su, S. Li, S. Yao, S. Hu, L. Kaplan, T. Amin, T. Abdelzaher, and W. Hongwei. Scalable social sensing of interdependent phenomena. In Proc. of the International Conference on Information Processing in Sensor Networks (IPSN'15), pages 202--213, 2015. Google ScholarDigital Library
- S. Wang, D. Wang, L. Su, L. Kaplan, and T. Abdelzaher. Towards cyber-physical systems in social spaces: The data reliability challenge. In Proc. of the IEEE Real-Time Systems Symposium (RTSS'14), pages 74--85, 2014.Google ScholarCross Ref
- P. Welinder, S. Branson, S. Belongie, and P. Perona. The multidimensional wisdom of crowds. In Advances in Neural Information Processing Systems (NIPS'10), pages 2424--2432, 2010.Google Scholar
- J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more: Optimal integration of labelers of unknown expertise. In Advances in Neural Information Processing Systems (NIPS'09), pages 2035--2043, 2009.Google Scholar
- M. Wu and A. Marian. A framework for corroborating answers from multiple web sources. Information Systems, 36(2):431--449, 2011. Google ScholarDigital Library
- C. Xu, D. Tao, and C. Xu. A survey on multi-view learning. arXiv preprint arXiv:1304.5634, 2013.Google Scholar
- X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. In Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'07), pages 1048--1052, 2007. Google ScholarDigital Library
- X. Yin and W. Tan. Semi-supervised truth discovery. In Proc. of the International Conference on World Wide Web (WWW'11), pages 217--226, 2011. Google ScholarDigital Library
- D. Yu, H. Huang, T. Cassidy, H. Ji, C. Wang, S. Zhi, J. Han, C. Voss, and M. Magdon-Ismail. The wisdom of minority: Unsupervised slot filling validation based on multi-dimensional truth-finding. In Proc. of the International Conference on Computational Linguistics (COLING'14), 2014.Google Scholar
- B. Zhao and J. Han. A probabilistic model for estimating real-valued truth from conflicting sources. In Proc. of the VLDB workshop on Quality in Databases (QDB'12), 2012.Google Scholar
- B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6):550--561, 2012. Google ScholarDigital Library
- Z. Zhao, J. Cheng, and W. Ng. Truth discovery in data streams: A single-pass probabilistic approach. In Proc. of the ACM Conference on Information and Knowledge Management (CIKM'14), pages 1589--1598, 2014. Google ScholarDigital Library
- S. Zhi, B. Zhao, W. Tong, J. Gao, D. Yu, H. Ji, and J. Han. Modeling truth existence in truth discovery. In Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'15), pages 1543--1552, 2015. Google ScholarDigital Library
- D. Zhou, J. C. Platt, S. Basu, and Y. Mao. Learning from the wisdom of crowds by minimax entropy. In Advances in Neural Information Processing Systems (NIPS'12), pages 2204--2212, 2012.Google Scholar
- Z.-H. Zhou. Ensemble methods: foundations and algorithms. Chapman & Hall/CRC Machine Learning & Pattern Recognition Series, 2012. Google ScholarDigital Library
Index Terms
- A Survey on Truth Discovery
Recommendations
Empowering Truth Discovery with Multi-Truth Prediction
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge ManagementTruth discovery is the problem of detecting true values from the conflicting data provided by multiple sources on the same data items. Since sources' reliability is unknown a priori, a truth discovery method usually estimates sources' reliability along ...
Truth Discovery in Data Streams: A Single-Pass Probabilistic Approach
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge ManagementTruth discovery is a long-standing problem for assessing the validity of information from various data sources that may provide different and conflicting information. With the increasing prominence of data streams arising in a wide range of applications ...
On the Discovery of Continuous Truth: A Semi-supervised Approach with Partial Ground Truths
Web Information Systems Engineering – WISE 2018AbstractIn many applications, the information regarding to the same object can be collected from multiple sources. However, these multi-source data are not reported consistently. In the light of this challenge, truth discovery is emerged to identify truth ...
Comments