ABSTRACT
Fact collections are mostly built using semi-supervised relation extraction techniques and wisdom of the crowds methods, rendering them inherently noisy. In this paper, we propose to validate the resulting facts by leveraging global constraints inherent in large fact collections, observing that correct facts will tend to match their arguments with other facts more often than with incorrect ones. We model this intuition as a graph-ranking problem over a fact graph and explore novel random walk algorithms. We present an empirical study, over a large set of facts extracted from a 500 million document webcrawl, validating the model and showing that it improves fact quality over state-of-the-art methods.
- {Agichtein and Gravano 2000} Agichtein, Eugene and Luis Gravano. 2000. Snowball: Extracting relations from large plain-text collections. In DL-00. Google ScholarDigital Library
- {Auer et al. 2008} Auer, S., C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. 2008. Dbpedia: A nucleus for a web of open data. In ISWC+ASWC 2007. Google ScholarDigital Library
- {Banko and Etzioni 2008} Banko, Michele and Oren Etzioni. 2008. The tradeoffs between open and traditional relation extraction. In ACL-08.Google Scholar
- {Banko et al. 2007} Banko, Michele, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open information extraction from the web. In Proceedings of IJCAI-07. Google ScholarDigital Library
- {Cafarella et al. 2007a} Cafarella, Michael, Dan Suciu, and Oren Etzioni. 2007a. Navigating extracted data with schema discovery. In Proceedings of WWW-07.Google Scholar
- {Cafarella et al. 2007b} Cafarella, Michael J., Christopher Re, Dan Suciu, Oren Etzioni, and Michele Banko. 2007b. Structured querying of web text: A technical challenge. In Proceedings of CIDR-07.Google Scholar
- {Cohen and McCallum 2003} Cohen, William and Andrew McCallum. 2003. Information extraction from the World Wide Web (tutorial). In KDD.Google Scholar
- {Davidov and Rappoport 2008} Davidov, Dmitry and Ari Rappoport. 2008. Unsupervised discovery of generic relationships using pattern clusters and its evaluation by automatically generated sat analogy questions. In ACL-08.Google Scholar
- {Downey et al. 2005} Downey, Doug, Oren Etzioni, and Stephen Soderland. 2005. A probabilistic model of redundancy in information extraction. In Proceedings of IJCAI-05. Google ScholarDigital Library
- {Erkan and Radev 2004} Erkan, Güneş and Dragomir R. Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. JAIR, 22:457--479. Google ScholarDigital Library
- {Etzioni et al. 2004} Etzioni, Oren, Michael J. Cafarella, Doug Downey, Stanley Kok, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. 2004. Web-scale information extraction in KnowItAll. In Proceedings of WWW-04. Google ScholarDigital Library
- {Etzioni et al. 2005} Etzioni, Oren, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. 2005. Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell., 165:91--134. Google ScholarDigital Library
- {Hassan et al. 2007} Hassan, Samer, Rada Mihalcea, and Carmen Banea. 2007. Random-walk term weighting for improved text classification. ICSC. Google ScholarDigital Library
- {Hearst 1992} Hearst, Marti A. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of COLING-92. Google ScholarDigital Library
- {Kleinberg 1999} Kleinberg, Jon Michael. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632. Google ScholarDigital Library
- {Lenat 1995} Lenat, Douglas B. 1995. Cyc: a large-scale investment in knowledge infrastructure. Commun. ACM, 38(11). Google ScholarDigital Library
- {Liu and Yang 2008} Liu, Nathan and Qiang Yang. 2008. Eigenrank: a ranking-oriented approach to collaborative filtering. In SIGIR 2008. Google ScholarDigital Library
- {Matuszek et al. 2005} Matuszek, Cynthia, Michael Witbrock, Robert C. Kahlert, John Cabral, Dave Schneider, Purvesh Shah, and Doug Lenat. 2005. Searching for common sense: Populating cyc from the web. In AAAI-05. Google ScholarDigital Library
- {Mintz et al. 2009} Mintz, Mike, Steven Bills, Rion Snow, and Daniel Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In ACL-09. Google ScholarDigital Library
- {Paşca et al. 2006} Paşca, Marius, Dekang Lin, Jeffrey Bigham, Andrei Lifchits, and Alpa Jain. 2006. Organizing and searching the world wide web of facts - step one: The one-million fact extraction challenge. In Proceedings of AAAI-06. Google ScholarDigital Library
- {Page et al. 1999} Page, Lawrence, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the Web. Technical Report 1999/66, Stanford University, Computer Science Department.Google Scholar
- {Pantel and Pennacchiotti 2006} Pantel, Patrick and Marco Pennacchiotti. 2006. Espresso: leveraging generic patterns for automatically harvesting semantic relations. In ACL/COLING-06. Google ScholarDigital Library
- {Pantel et al. 2004} Pantel, Patrick, Deepak Ravichandran, and Eduard Hovy. 2004. Towards terascale knowledge acquisition. In COLING-04. Google ScholarDigital Library
- {Pantel et al. 2009} Pantel, Patrick, Eric Crestan, Arkady Borkovsky, Ana-Maria Popescu, and Vishnu Vyas. 2009. Web-scale distributional similarity and entity set expansion. In EMNLP-09. Google ScholarDigital Library
- {Ravichandran and Hovy 2002} Ravichandran, Deepak and Eduard Hovy. 2002. Learning surface text patterns for a question answering system. In Proceedings of ACL-08, pages 41--47. Association for Computational Linguistics. Google ScholarDigital Library
- {Riloff and Jones 1999} Riloff, Ellen and Rosie Jones. 1999. Learning dictionaries for information extraction by multilevel bootstrapping. In Proceedings of AAAI-99. Google ScholarDigital Library
- {Talukdar et al. 2008} Talukdar, Partha Pratim, Joseph Reisinger, Marius Pasca, Deepak Ravichandran, Rahul Bhagat, and Fernando Pereira. 2008. Weakly-supervised acquisition of labeled class instances using graph random walks. In Proceedings of EMNLP-08. Google ScholarDigital Library
- {Yan et al. 2009} Yan, Yulan, Yutaka Matsuo, Zhenglu Yang, and Mitsuru Ishizuka. 2009. Unsupervised relation extraction by mining wikipedia texts with support from web corpus. In ACL-09. Google ScholarDigital Library
- FactRank: random walks on a web of facts
Recommendations
Mark-copy: fast copying GC with less space overhead
OOPSLA '03: Proceedings of the 18th annual ACM SIGPLAN conference on Object-oriented programing, systems, languages, and applicationsCopying garbage collectors have a number of advantages over non-copying collectors, including cheap allocation and avoiding fragmentation. However, in order to provide completeness (the guarantee to reclaim each garbage object eventually), standard ...
Mark-copy: fast copying GC with less space overhead
Special Issue: Proceedings of the OOPSLA '03 conferenceCopying garbage collectors have a number of advantages over non-copying collectors, including cheap allocation and avoiding fragmentation. However, in order to provide completeness (the guarantee to reclaim each garbage object eventually), standard ...
A generational on-the-fly garbage collector for Java
PLDI '00: Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementationAn on-the-fly garbage collector does not stop the program threads to perform the collection. Instead, the collector executes in a separate thread (or process) in parallel to the program. On-the-fly collectors are useful for multi-threaded applications ...
Comments