research-article

Free Access

FactRank: random walks on a web of facts

Authors:
Alpa Jain

Yahoo! Labs

Yahoo! Labs
View Profile

,
Patrick Pantel

Microsoft Research

Microsoft Research
View Profile

Authors Info & Claims

COLING '10: Proceedings of the 23rd International Conference on Computational LinguisticsAugust 2010Pages 501–509

Published:23 August 2010Publication History

COLING '10: Proceedings of the 23rd International Conference on Computational Linguistics

Pages 501–509

ABSTRACT

Fact collections are mostly built using semi-supervised relation extraction techniques and wisdom of the crowds methods, rendering them inherently noisy. In this paper, we propose to validate the resulting facts by leveraging global constraints inherent in large fact collections, observing that correct facts will tend to match their arguments with other facts more often than with incorrect ones. We model this intuition as a graph-ranking problem over a fact graph and explore novel random walk algorithms. We present an empirical study, over a large set of facts extracted from a 500 million document webcrawl, validating the model and showing that it improves fact quality over state-of-the-art methods.

References

{Agichtein and Gravano 2000} Agichtein, Eugene and Luis Gravano. 2000. Snowball: Extracting relations from large plain-text collections. In DL-00. Google ScholarDigital Library
{Auer et al. 2008} Auer, S., C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. 2008. Dbpedia: A nucleus for a web of open data. In ISWC+ASWC 2007. Google ScholarDigital Library
{Banko and Etzioni 2008} Banko, Michele and Oren Etzioni. 2008. The tradeoffs between open and traditional relation extraction. In ACL-08.Google Scholar
{Banko et al. 2007} Banko, Michele, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open information extraction from the web. In Proceedings of IJCAI-07. Google ScholarDigital Library
{Cafarella et al. 2007a} Cafarella, Michael, Dan Suciu, and Oren Etzioni. 2007a. Navigating extracted data with schema discovery. In Proceedings of WWW-07.Google Scholar
{Cafarella et al. 2007b} Cafarella, Michael J., Christopher Re, Dan Suciu, Oren Etzioni, and Michele Banko. 2007b. Structured querying of web text: A technical challenge. In Proceedings of CIDR-07.Google Scholar
{Cohen and McCallum 2003} Cohen, William and Andrew McCallum. 2003. Information extraction from the World Wide Web (tutorial). In KDD.Google Scholar
{Davidov and Rappoport 2008} Davidov, Dmitry and Ari Rappoport. 2008. Unsupervised discovery of generic relationships using pattern clusters and its evaluation by automatically generated sat analogy questions. In ACL-08.Google Scholar
{Downey et al. 2005} Downey, Doug, Oren Etzioni, and Stephen Soderland. 2005. A probabilistic model of redundancy in information extraction. In Proceedings of IJCAI-05. Google ScholarDigital Library
{Erkan and Radev 2004} Erkan, Güneş and Dragomir R. Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. JAIR, 22:457--479. Google ScholarDigital Library
{Etzioni et al. 2004} Etzioni, Oren, Michael J. Cafarella, Doug Downey, Stanley Kok, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. 2004. Web-scale information extraction in KnowItAll. In Proceedings of WWW-04. Google ScholarDigital Library
{Etzioni et al. 2005} Etzioni, Oren, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. 2005. Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell., 165:91--134. Google ScholarDigital Library
{Hassan et al. 2007} Hassan, Samer, Rada Mihalcea, and Carmen Banea. 2007. Random-walk term weighting for improved text classification. ICSC. Google ScholarDigital Library
{Hearst 1992} Hearst, Marti A. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of COLING-92. Google ScholarDigital Library
{Kleinberg 1999} Kleinberg, Jon Michael. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632. Google ScholarDigital Library
{Lenat 1995} Lenat, Douglas B. 1995. Cyc: a large-scale investment in knowledge infrastructure. Commun. ACM, 38(11). Google ScholarDigital Library
{Liu and Yang 2008} Liu, Nathan and Qiang Yang. 2008. Eigenrank: a ranking-oriented approach to collaborative filtering. In SIGIR 2008. Google ScholarDigital Library
{Matuszek et al. 2005} Matuszek, Cynthia, Michael Witbrock, Robert C. Kahlert, John Cabral, Dave Schneider, Purvesh Shah, and Doug Lenat. 2005. Searching for common sense: Populating cyc from the web. In AAAI-05. Google ScholarDigital Library
{Mintz et al. 2009} Mintz, Mike, Steven Bills, Rion Snow, and Daniel Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In ACL-09. Google ScholarDigital Library
{Paşca et al. 2006} Paşca, Marius, Dekang Lin, Jeffrey Bigham, Andrei Lifchits, and Alpa Jain. 2006. Organizing and searching the world wide web of facts - step one: The one-million fact extraction challenge. In Proceedings of AAAI-06. Google ScholarDigital Library
{Page et al. 1999} Page, Lawrence, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the Web. Technical Report 1999/66, Stanford University, Computer Science Department.Google Scholar
{Pantel and Pennacchiotti 2006} Pantel, Patrick and Marco Pennacchiotti. 2006. Espresso: leveraging generic patterns for automatically harvesting semantic relations. In ACL/COLING-06. Google ScholarDigital Library
{Pantel et al. 2004} Pantel, Patrick, Deepak Ravichandran, and Eduard Hovy. 2004. Towards terascale knowledge acquisition. In COLING-04. Google ScholarDigital Library
{Pantel et al. 2009} Pantel, Patrick, Eric Crestan, Arkady Borkovsky, Ana-Maria Popescu, and Vishnu Vyas. 2009. Web-scale distributional similarity and entity set expansion. In EMNLP-09. Google ScholarDigital Library
{Ravichandran and Hovy 2002} Ravichandran, Deepak and Eduard Hovy. 2002. Learning surface text patterns for a question answering system. In Proceedings of ACL-08, pages 41--47. Association for Computational Linguistics. Google ScholarDigital Library
{Riloff and Jones 1999} Riloff, Ellen and Rosie Jones. 1999. Learning dictionaries for information extraction by multilevel bootstrapping. In Proceedings of AAAI-99. Google ScholarDigital Library
{Talukdar et al. 2008} Talukdar, Partha Pratim, Joseph Reisinger, Marius Pasca, Deepak Ravichandran, Rahul Bhagat, and Fernando Pereira. 2008. Weakly-supervised acquisition of labeled class instances using graph random walks. In Proceedings of EMNLP-08. Google ScholarDigital Library
{Yan et al. 2009} Yan, Yulan, Yutaka Matsuo, Zhenglu Yang, and Mitsuru Ishizuka. 2009. Unsupervised relation extraction by mining wikipedia texts with support from web corpus. In ACL-09. Google ScholarDigital Library

FactRank: random walks on a web of facts
1. Information systems

Recommendations

Mark-copy: fast copying GC with less space overhead
OOPSLA '03: Proceedings of the 18th annual ACM SIGPLAN conference on Object-oriented programing, systems, languages, and applications

Copying garbage collectors have a number of advantages over non-copying collectors, including cheap allocation and avoiding fragmentation. However, in order to provide completeness (the guarantee to reclaim each garbage object eventually), standard ...
Read More
Mark-copy: fast copying GC with less space overhead
Special Issue: Proceedings of the OOPSLA '03 conference

Copying garbage collectors have a number of advantages over non-copying collectors, including cheap allocation and avoiding fragmentation. However, in order to provide completeness (the guarantee to reclaim each garbage object eventually), standard ...
Read More
A generational on-the-fly garbage collector for Java
PLDI '00: Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation

An on-the-fly garbage collector does not stop the program threads to perform the collection. Instead, the collector executes in a separate thread (or process) in parallel to the program. On-the-fly collectors are useful for multi-threaded applications ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
COLING '10: Proceedings of the 23rd International Conference on Computational Linguistics
August 2010
1408 pages
General Chair:
Aravind K. Joshi
University of Pennsylvania
,
Program Chairs:
Chu-Ren Huang
The Hong Kong Polytechnic University
,
Dan Jurafsky
Stanford University
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 23 August 2010
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,537of1,537submissions,100%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 157
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

FactRank: random walks on a web of facts

COLING '10: Proceedings of the 23rd International Conference on Computational Linguistics

ABSTRACT

References

Cited By

Recommendations

Mark-copy: fast copying GC with less space overhead

Mark-copy: fast copying GC with less space overhead

A generational on-the-fly garbage collector for Java

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

FactRank: random walks on a web of facts

COLING '10: Proceedings of the 23rd International Conference on Computational Linguistics

ABSTRACT

References

Cited By

Recommendations

Mark-copy: fast copying GC with less space overhead

Mark-copy: fast copying GC with less space overhead

A generational on-the-fly garbage collector for Java

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media