Integration of heterogeneous databases without common domains using queries based on textual similarity

Author:
William W. Cohen

AT&T Labs-Research, 180 Park Avenue, Florham Park NJ

AT&T Labs-Research, 180 Park Avenue, Florham Park NJ
View Profile

Authors Info & Claims

ACM SIGMOD Record Volume 27 Issue 2June 1998pp 201–212https://doi.org/10.1145/276305.276323

Published:01 June 1998Publication History

ACM SIGMOD Record

Abstract

Most databases contain “name constants” like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. However, in many cases, this assumption does not hold; determining if two name constants should be considered identical can require detailed knowledge of the world, the purpose of the user's query, or both. In this paper, we reject the assumption that global domains can be easily constructed, and assume instead that the names are given in natural language text. We then propose a logic called WHIRL which reasons explicitly about the similarity of local names, as measured using the vector-space model commonly adopted in statistical information retrieval. We describe an efficient implementation of WHIRL and evaluate it experimentally on data extracted from the World Wide Web. We show that WHIRL is much faster than naive inference methods, even for short queries. We also show that inferences made by WHIRL are surprisingly accurate, equaling the accuracy of hand-coded normalization routines on one benchmark problem, and outperforming exact matching with a plausible global domain on a second.

References

1 Serge Abiteboul and Victor Vianu. Regular path queries with constraints. In Proceedings of the Sixteenth A CM SIGA CT- SIGAIOD-SIGART Symposium on Principles of Database Systems (PODS-97), Tucson, AZ, May 1997. Google ScholarDigital Library
2 Yigal Arens, Craig A. Knoblock, and Chun-Nan Hsu. Query processing in the SIMS information mediator. In Austin Tate, editor, Advanced Planning Technology. AAAI Press, Menlo Park, CA, 1996.Google Scholar
3 Paolo Atzeni, Giansalvatore Mecca, and Paolo Merialdo. Semistructured and structured data on the Web: going back and forth. In Dan Suciu, editor, Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997. Available on-line from http://www.research.at t.com / suciu / workshop-papers.html.Google Scholar
4 Daniel Barbara, Hector Garcia-Molina, and Daryl Porter. The management of probabilistic data. IEEE Transations on knowledge and data engineering, 4(5):487-501, October 1992. Google ScholarDigital Library
5 Brian T. Bartell, Garrison W. Cottrell, and Richard K. Belew. Automatic combination of multiple ranked retrieval systems. In Seventeenth Annual International A CM SIGIR Conference on Research and Development in Information Retrieval, 1994. Google ScholarDigital Library
6 R. J. Bayardo, W. Bohrer, R. Brice, A. Cichocki, J. Fowler, A. Helal, V. Kashyap, T. Ksiezyk, G. Martin, M. Nodine, M. Rashid, M. Rusinkiewicz, R. Shea, C. Unnikrishan, A. Unruh, and D: Woelk. Infosleuth: an agent-based semantic integration of information in open and dynamic environments. In Proceedings of the 1997 A CM SIGMOD, May 1997. Google ScholarDigital Library
7 Justin Boyan, Dane Freitag, and Thorsten Joachims. A machine learning architecture for optimizing web search engines. Technical Report WS-96-05, American Association of Artificial Intelligence, 1994.Google Scholar
8 S. Chaudhuri, U. Dayal, and T. 'fan. Join queries with external text sources: execution and optimization techniques. In Proceedings of the 1995 ACM SIGMOD, May 1995. Google ScholarDigital Library
9 William W. Cohen. Knowledge integration for structured information sources containing text (extended abstract). In The SIGIR-97 Workshop on Networked Information Retrieval, 1997.Google Scholar
10 William W. Cohen. A Web-based information system that reasons with structured collections of text. In Proceedings of Autonomous Agents-98, St. Paul, MN, 1998. Google ScholarDigital Library
11 William W. Cohen, Rob Schapire, and Yoram Singer. Learning to order things. To appear in NIPS-97, 1997. Google ScholarDigital Library
12 William W. Cohen and Yoram Singer. Context-sensitive learning methods for text categorization, in Proceedings of the 19th Annual International A CM Conference on Research and Development in information Retrieval, pages 307-315, Zurich, Switzerland, 1996. ACM Press. Google ScholarDigital Library
13 Oliver M. Duschka and Michael R. Genesereth. Answering recursive queries using views. In Proceedings of the Sixteenth A CM SIGA CT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS-97), Tucson, AZ, May 1997. Google ScholarDigital Library
14 Oliver M. Duschka and Michael R. Genesereth. Query planning in infomaster. In Proceedings of the Twelfth Annual A CM Symposium on Applied Computing (SA C97), San Jose, CA, February 1997. Google ScholarDigital Library
15 Douglas Fang, Joachim Hammer, and Dennis McLeod. The identification and resolution of semantic heterogeneity in multidatabase systems. In Multidatabase Systems: An Advanced Solution for Global Information Sharing, pages 52- 60. IEEE Computer Society Press, Los Alamitos, California, 1994.Google Scholar
16 I. P. Felligi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183-1210, 1969.Google ScholarCross Ref
17 Thorsten Fiebig, Jurgen Weiss, and Guido Moerkotte. RAW: a relational algebra for the Web. In Dan Suciu, editor, Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997. Available on-line from http://www.research.att.com/suciu/workshop-papers.html.Google Scholar
18 Norbert Fuhr. Probabilistic Datalog~a logic for powerful retrieval methods. In Proceedings of the 1995 A CM SIGIR conference on research in information retrieval, pages 282- 290, New York, 1995. Google ScholarDigital Library
19 H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, 3. Ullman, and J. Widom. The TSIM- MIS approach to mediation: Data models and languages (extended abstract). In Next Generation Information Technologies and Systems (NGITS-95), Naharia, Israel, November 1995.Google Scholar
20 M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the 1995 A CM SIGMOD, May 1995. Google ScholarDigital Library
21 Scott Huffman and David Steier. Heuristic joins to integrate structured heterogeneous data. In Working notes of the AAAI spring symposium on information gathering in heterogeneous distributed environments, Palo Alto, CA, March 1995. AAAI Press.Google Scholar
22 B. Kilss and W. Alvey (ed). Record linkage techniques-- 1985. Statistics of Income Division, Internal Revenue Service Publication 1299-2-96. Available from http: / / www.bt s.gov/fcsm / methodology/, 1985.Google Scholar
23 Donald E. Knuth. The Art of Computer Programming, Volume I: Fundamental Algorithms (second edition). Addison- Wesley, Reading MA, 1975. Google ScholarDigital Library
24 D. Konopnicki and O. Schmueli. W3QS: a query system for the world wide web. In Proceedings of the 21nd International Conference on Very Large Databases (VLDB-96), Zurich, Switzerland, 1995. Google ScholarDigital Library
25 Richard Korf. Linear-space best-first search. Artificial intelligence, 62(1):41-78, July 1993. Google ScholarDigital Library
26 Alon Y. Levy, Anand Rajaraman, and Joann J. Ordille. Query answering algorithms for information agents. In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI-96), Portland, Oregon, august 1996. Google ScholarDigital Library
27 Alon Y. Levy, Anand Rajaraman, and Joann J. Ordille. Querying heterogeneous information sources using source descriptions. In Proceedings of the 22nd International Conference on Very Large Databases (VLDB-96), Bombay, India, September 1996. Google ScholarDigital Library
28 David Lewis. Representation and learning in information retrieval. Technical Report 91-93, Computer Science Dept., University of Massachusetts at Amherst, 1992. PhD Thesis. Google ScholarDigital Library
29 Alberto Mendelzon and Tovo Milo. Formal models of Web queries. In Proceedings of the Sixteenth A CM SIGACT- SIGMOD-SIGART Symposium on Principles of Database Systems (PODS-97), Tucson, AZ, May 1997. Google ScholarDigital Library
30 A. Monte and C. Elkan. The field-matching problem: algorithm and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, August 1996.Google Scholar
31 A. Monte and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In The proceedings of the SIGMOD 1997 workshop on data mining and knowledge discovery, May 1997.Google Scholar
32 H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic linkage of vital records. Science, 130:954- 959, 1959.Google ScholarCross Ref
33 Nils Nilsson. Principles of Artificial intelligence. Morgan Kaufmann, 1987. Google ScholarDigital Library
34 M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130-137, 1980.Google ScholarCross Ref
35 J. Ross Quinlan. Learning logical definitions from relations. Machine Learning, 5(3), 1990. Google ScholarDigital Library
36 Gerard Salton, editor. Automatic Text Processing. Addison Welsley, Reading, Massachusetts, 1989. Google ScholarDigital Library
37 Peter Sch~4uble. SPIDER: A multiuser information retrieval system for semistructured and dynamic data. In Proceedings of the 1993 A CM SIGIR conference on research in information retrieval, pages 318-327, Pittsburgh, PA, 1993. Google ScholarDigital Library
38 Dan Suciu. Query decomposition and view maintanance for query languages for unstructured data. in Proceedings of the 22nd International Conference on Very Large Databases (VLDB-g6), Bombay, India, 1996. Google ScholarDigital Library
39 Dan Suciu, editor. Proceedings of the Workshop on Management of Semistructured Data. Available on-line from http://ww w. research, at t. corn / suciu /workshop-papers.html, Tucson, Arizona, May 1997.Google Scholar
40 Anthony Tomasic, Remy Amouroux, Philippe Bonnet, and Olga Kapitskaia. The distributed information search component (Disco) and the World Wide Web. In Proceedings of the 1997 ACM SIGMOD, May 1997. Google ScholarDigital Library
41 Howard Turtle and James Flood. Query evaluation: strategies and optimizations. Information processing and management, 31(6):831-850, November 1995. Google ScholarDigital Library

Index Terms

Integration of heterogeneous databases without common domains using queries based on textual similarity

Recommendations

Integration of heterogeneous databases without common domains using queries based on textual similarity
SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data

Most databases contain “name constants” like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into ...
Read More
Providing database-like access to the Web using queries based on textual similarity

Most databases contain “name constants” like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into ...
Read More
Providing database-like access to the Web using queries based on textual similarity
SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data

Most databases contain “name constants” like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGMOD Record Volume 27, Issue 2
June 1998
595 pages
ISSN:0163-5808
DOI:10.1145/276305
Chairmen:
Laura Haas
IBM Almaden Research Center, San Jose, CA
,
Pamela Drew
Boeing Co.
,
Editor:
Ashutosh Tiwary
Boeing Co.; and Univ. of Washington, Seattle
Issue’s Table of Contents
SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data
June 1998
599 pages
ISBN:0897919955
DOI:10.1145/276304
Chairmen:
Laura Haas
IBM AlmadenResearch Center, San Jose, CA
,
Pamela Drew
Boeing Co.
,
Editors:
Ashutosh Tiwary
Boeing Co.; and Univ. of Washington, Seattle
,
Michael Franklin
Univ. of Maryland, College Park
Copyright © 1998 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 1998
Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 241
  Total Citations
  View Citations
- 2,000
  Total Downloads
- Downloads (Last 12 months)91
- Downloads (Last 6 weeks)23
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Integration of heterogeneous databases without common domains using queries based on textual similarity

ACM SIGMOD Record

Abstract

References

Cited By

Index Terms

Recommendations

Integration of heterogeneous databases without common domains using queries based on textual similarity

Providing database-like access to the Web using queries based on textual similarity

Providing database-like access to the Web using queries based on textual similarity