research-article

Semantic text similarity using corpus-based word similarity and string similarity

Authors:
Aminul Islam

University of Ottawa, ON, Canada

University of Ottawa, ON, Canada
View Profile

,
Diana Inkpen

University of Ottawa, ON, Canada

University of Ottawa, ON, Canada
View Profile

ACM Transactions on Knowledge Discovery from Data Volume 2 Issue 2Article No.: 10pp 1–25https://doi.org/10.1145/1376815.1376819

Published:24 July 2008Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

We present a method for measuring the semantic similarity of texts using a corpus-based measure of semantic word similarity and a normalized and modified version of the Longest Common Subsequence (LCS) string matching algorithm. Existing methods for computing text similarity have focused mainly on either large documents or individual words. We focus on computing the similarity between two sentences or two short paragraphs. The proposed method can be exploited in a variety of applications involving textual knowledge representation and knowledge discovery. Evaluation results on two different data sets show that our method outperforms several competing methods.

References

Allison, L. and Dix, T. 1986. A bit-string longest-common-subsequence algorithm. Inf. Proc. Lett. 23, 305--310.]] Google ScholarDigital Library
Bollegala, D., Matsuo, Y., and Ishizuka, M. 2007. Measuring semantic similarity between words using web search engines. In WWW '07: Proceedings of the 16th International Conference on World Wide Web. ACM, New York, 757--766.]] Google ScholarDigital Library
Burgess, C., Livesay, K., and Lund, K. 1998. Explorations in context space: Words, sentences, discourse. Disc. Proc. 25, 2--3, 211--257.]]Google ScholarCross Ref
Coelho, T., Calado, P., Souza, L., Ribeiro-Neto, B., and Muntz, R. 2004. Image retrieval using multiple evidence ranking. IEEE Trans. Knowl. Data Eng. 16, 4, 408--417.]] Google ScholarDigital Library
Cohen, W. 2000. Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst. 18, 3, 288--321.]] Google ScholarDigital Library
Corley, C. and Mihalcea, R. 2005. Measures of text semantic similarity. In Proceedings of the ACL workshop on Empirical Modeling of Semantic Equivalence (Ann Arbor, MI).]] Google ScholarDigital Library
Dolan, W., Quirk, C., and Brockett, C. 2004. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the 20th International Conference on Computational Linguistics.]] Google ScholarDigital Library
Erkan, G. and Radev, D. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Research 22, 457--479.]] Google ScholarDigital Library
Foltz, P., Kintsch, W., and Landauer, T. 1998. The measurement of textual coherence with latent semantic analysis. Disc. Proc. 25, 2--3, 285--307.]]Google ScholarCross Ref
Frawley, W. 1992. Linguistic Semantics. Lawrence Erlbaum Associates, Hillsdale, NJ.]]Google Scholar
Hatzivassiloglou, V., Klavans, J., and Eskin, E. 1999. Detecting text similarity over short passages: Exploring linguistic feature combinations via machine learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. 203--212.]]Google Scholar
Islam, A. and Inkpen, D. 2006. Second order co-occurrence PMI for determining the semantic similarity of words. In Proceedings of the International Conference on Language Resources and Evaluation. (Genoa, Italy). 1033--1038.]]Google Scholar
Islam, A., Inkpen, D. Z., and Kiringa, I. 2008. Applications of corpus-based semantic similarity and word segmentation to database schema matching. The VLDB Journal (Published online).]] Google ScholarDigital Library
Jackendoff, R. 1983. Semantics and Cognition. MIT Press, Cambridge, MA.]]Google Scholar
Jarmasz, M. and Szpakowicz, S. 2003. Roget's thesaurus and semantic similarity. In Proceedings of the International Conference on Recent Advances in Natural Language Processing. 212--219.]]Google Scholar
Jiang, J. and Conrath, D. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the International Conference on Research in Computational Linguistics.]]Google Scholar
Katarzyna, W.-W. and Szczepaniak, P. 2005. Classification of rss-formatted documents using full text similarity measures. In Proceedings of the 5th International Conference on Web Engineering, D. Lowe and M. Gaedke, Eds. LNCS 3579. Springer, 400--405.]] Google ScholarDigital Library
Ko, Y., Park, J., and Seo, J. 2004. Improving text categorization using the importance of sentences. Inf. Proc. Manage. 40, 65--79.]] Google ScholarDigital Library
Kondrak, G. 2005. N-gram similarity and distance. In Proceedings of the 12h International Conference on String Processing and Information Retrieval (Buenos Aires, Argentina). 115--126.]] Google ScholarDigital Library
Landauer, T. and Dumais, S. 1997. A solution to platos problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psych. Rev. 104, 2, 211--240.]]Google ScholarCross Ref
Landauer, T., Foltz, P., and Laham, D. 1998. Introduction to latent semantic analysis. Dis. Proc. 25, 2--3, 259--284.]]Google ScholarCross Ref
Lapata, M. and Barzilay, R. 2005. Automatic evaluation of text coherence: Models and representations. In Proceedings of the 19th International Joint Conference on AI.]] Google ScholarDigital Library
Leacock, C. and Chodorow, M. 1998. WordNet: An electronic lexical database. MIT Press, Chapter Combining local context andWordNet similarity for word sense identification, 265--283.]]Google Scholar
Lesk, M. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the SIGDOC Conference.]] Google ScholarDigital Library
Li, Y., Bandar, Z., and Mclean, D. 2003. An approach for measuring semantic similarity using multiple information sources. IEEE Trans. Knowl. Data Eng. 15, 4, 871--882.]] Google ScholarDigital Library
Li, Y., McLean, D., Bandar, Z., O'Shea, J., and Crockett, K. 2006. Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18, 8, 1138--1149.]] Google ScholarDigital Library
Lin, C. and Hovy, E. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the Human Language Technology Conference.]] Google ScholarDigital Library
Lin, D. 1998. An information-theoretic definition of similarity. In Proceedings of the International Conference on Machine Learning.]] Google ScholarDigital Library
Liu, T. and Guo, J. 2005. Text similarity computing based on standard deviation. In Proceedings of the International Conference on Intelligent Computing, D.-S. Huang, X.-P. Zhang, and G.-B. Huang, Eds. Lecture Notes in Computer Science, vol. 3644. Springer-Verlag, New York, 456--464.]] Google ScholarDigital Library
Liu, Y. and Zong, C. 2004. Example-based chinese-english mt. In Proceedings of the 2004 IEEE International Conference on Systems, Man, and Cybernetics. Vol. 1--7. IEEE Computer Society Press, Los Alamitos, CA, 6093--6096.]]Google Scholar
Madhavan, J., Bernstein, P., Doan, A., and Halevy, A. 2005. Corpus-based schema matching. In Proceedings of the International Conference on Data Engineering.]] Google ScholarDigital Library
Maguitman, A., Menczer, F., Roinestad, H., and Vespignani, A. 2005. Algorithmic detection of semantic similarity. In Proceedings of the 14th International World Wide Web Conference.]] Google ScholarDigital Library
Meadow, C., Boyce, B., and Kraft, D. 2000. Text Information Retrieval Systems, second ed. Academic Press.]] Google ScholarDigital Library
Melamed, I. D. 1999. Bitext maps and alignment via pattern recognition. Computat. Linguist. 25, 1, 107--130.]] Google ScholarDigital Library
Mihalcea, R., Corley, C., and Strapparava, C. 2006. Corpus-based and knowledge-based measures of text semantic similarity. In Proceedings of the American Association for Artificial Intelligence. (Boston, MA).]] Google ScholarDigital Library
Miller, G., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. 1993. Introduction to wordnet: An on-line lexical database. Tech. Rep. 43, Cognitive Science Laboratory, Princeton University, Princeton, NJ.]]Google Scholar
Miller, G. A. and Charles, W. G. 1991. Contextual correlates of semantic similarity. Lang. and Cognitive Processes 6, 1, 1--28.]]Google ScholarCross Ref
Papineni, K., Roukos, S., Ward, T., and Zhu, W. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting Association for Computational Linguistics.]] Google ScholarDigital Library
Park, E., Ra, D., and Jang, M. 2005. Techniques for improving web retrieval effectiveness. Inf. Processing and Management 41, 5, 1207--1223.]]Google ScholarDigital Library
Resnik, P. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on AI.]] Google ScholarDigital Library
Rodriguez, M. A. and Egenhofer, M. J. 2003. Determining semantic similarity among entity classes from different ontologies. IEEE Trans. Knowl. Data Eng. 15, 2, 442--456.]] Google ScholarDigital Library
Rubenstein, H. and Goodenough, J. B. 1965. Contextual correlates of synonymy. Comm. ACM 8, 10, 627--633.]] Google ScholarDigital Library
Salton, G. and Lesk, M. 1971. Computer Evaluation of Indexing and Text Processing. Prentice Hall, Inc. Englewood Cliffs, NJ.]]Google Scholar
Schallehn, E., Sattler, K., and Saake, G. 2004. Efficient similarity-based operations for data integration. Data Knowl. Eng. 48, 361--387.]] Google ScholarDigital Library
Schutze, H. 1998. Automatic word sense discrimination. Computat. Linguist. 24, 1, 97--124.]] Google ScholarDigital Library
Sinclair, J., Ed. 2001. Collins Cobuild English Dictionary for Advanced Learners, third ed. Harper Collins.]]Google Scholar
Turney, P. 2001. Mining the web for synonyms: Pmi-ir versus LSA on TOEFL. In Proceedings of the 12th European Conference on Machine Learning.]] Google ScholarDigital Library
Weeds, J., Weir, D., and McCarthy, D. 2004. Characterising measures of lexical distributional similarity. In Proceedings of the 20th International Conference on Computational Linguistics. 1015--1021.]] Google ScholarDigital Library
Wiemer-Hastings, P. 2000. Adding syntactic information to lsa. In Proceedings of the 22nd Annual Conference Cognitive Science Society. 989--993.]]Google Scholar
Wu, Z. and Palmer, M. 1994. Verb semantics and lexical selection. In Proceedings of the Annual Meeting Association for Computational Linguistics.]] Google ScholarDigital Library

Index Terms

Semantic text similarity using corpus-based word similarity and string similarity
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction paradigms
      1. Natural language interfaces

Recommendations

Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair
Semantic word similarity is a quantitative measure of how much two words are contextually similar. Evaluation of semantic word similarity models requires a benchmark corpus. However, despite the millions of speakers and the large digital text of the Urdu ...
Read More
Using Fuzzy Set Similarity in Sentence Similarity Measures
2020 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)
Sentence similarity measures the similarity between two blocks of text. A semantic similarity measure between individual pairs of words, each taken from the two blocks of text, has been used in STASIS. Word similarity is measured based on the distance ...
Read More
Developing a Large Benchmark Corpus for Urdu Semantic Word Similarity
The semantic word similarity task aims to quantify the degree of similarity between a pair of words. In literature, efforts have been made to create standard evaluation resources to develop, evaluate, and compare various methods for semantic word ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Knowledge Discovery from Data Volume 2, Issue 2
July 2008
152 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/1376815
Issue’s Table of Contents

Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 July 2008
- Accepted: 1 May 2008
- Revised: 1 April 2008
- Received: 1 May 2007
Published in tkdd Volume 2, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Semantic similarity of words
corpus-based measures
similarity of short texts
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 291
  Total Citations
  View Citations
- 6,125
  Total Downloads
- Downloads (Last 12 months)257
- Downloads (Last 6 weeks)36
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Semantic text similarity using corpus-based word similarity and string similarity

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair

Using Fuzzy Set Similarity in Sentence Similarity Measures

Developing a Large Benchmark Corpus for Urdu Semantic Word Similarity

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Semantic text similarity using corpus-based word similarity and string similarity

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair

Using Fuzzy Set Similarity in Sentence Similarity Measures

Developing a Large Benchmark Corpus for Urdu Semantic Word Similarity

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media