skip to main content
research-article

Semantic text similarity using corpus-based word similarity and string similarity

Published:24 July 2008Publication History
Skip Abstract Section

Abstract

We present a method for measuring the semantic similarity of texts using a corpus-based measure of semantic word similarity and a normalized and modified version of the Longest Common Subsequence (LCS) string matching algorithm. Existing methods for computing text similarity have focused mainly on either large documents or individual words. We focus on computing the similarity between two sentences or two short paragraphs. The proposed method can be exploited in a variety of applications involving textual knowledge representation and knowledge discovery. Evaluation results on two different data sets show that our method outperforms several competing methods.

References

  1. Allison, L. and Dix, T. 1986. A bit-string longest-common-subsequence algorithm. Inf. Proc. Lett. 23, 305--310.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Bollegala, D., Matsuo, Y., and Ishizuka, M. 2007. Measuring semantic similarity between words using web search engines. In WWW '07: Proceedings of the 16th International Conference on World Wide Web. ACM, New York, 757--766.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Burgess, C., Livesay, K., and Lund, K. 1998. Explorations in context space: Words, sentences, discourse. Disc. Proc. 25, 2--3, 211--257.]]Google ScholarGoogle ScholarCross RefCross Ref
  4. Coelho, T., Calado, P., Souza, L., Ribeiro-Neto, B., and Muntz, R. 2004. Image retrieval using multiple evidence ranking. IEEE Trans. Knowl. Data Eng. 16, 4, 408--417.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Cohen, W. 2000. Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst. 18, 3, 288--321.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Corley, C. and Mihalcea, R. 2005. Measures of text semantic similarity. In Proceedings of the ACL workshop on Empirical Modeling of Semantic Equivalence (Ann Arbor, MI).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Dolan, W., Quirk, C., and Brockett, C. 2004. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the 20th International Conference on Computational Linguistics.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Erkan, G. and Radev, D. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Research 22, 457--479.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Foltz, P., Kintsch, W., and Landauer, T. 1998. The measurement of textual coherence with latent semantic analysis. Disc. Proc. 25, 2--3, 285--307.]]Google ScholarGoogle ScholarCross RefCross Ref
  10. Frawley, W. 1992. Linguistic Semantics. Lawrence Erlbaum Associates, Hillsdale, NJ.]]Google ScholarGoogle Scholar
  11. Hatzivassiloglou, V., Klavans, J., and Eskin, E. 1999. Detecting text similarity over short passages: Exploring linguistic feature combinations via machine learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. 203--212.]]Google ScholarGoogle Scholar
  12. Islam, A. and Inkpen, D. 2006. Second order co-occurrence PMI for determining the semantic similarity of words. In Proceedings of the International Conference on Language Resources and Evaluation. (Genoa, Italy). 1033--1038.]]Google ScholarGoogle Scholar
  13. Islam, A., Inkpen, D. Z., and Kiringa, I. 2008. Applications of corpus-based semantic similarity and word segmentation to database schema matching. The VLDB Journal (Published online).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jackendoff, R. 1983. Semantics and Cognition. MIT Press, Cambridge, MA.]]Google ScholarGoogle Scholar
  15. Jarmasz, M. and Szpakowicz, S. 2003. Roget's thesaurus and semantic similarity. In Proceedings of the International Conference on Recent Advances in Natural Language Processing. 212--219.]]Google ScholarGoogle Scholar
  16. Jiang, J. and Conrath, D. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the International Conference on Research in Computational Linguistics.]]Google ScholarGoogle Scholar
  17. Katarzyna, W.-W. and Szczepaniak, P. 2005. Classification of rss-formatted documents using full text similarity measures. In Proceedings of the 5th International Conference on Web Engineering, D. Lowe and M. Gaedke, Eds. LNCS 3579. Springer, 400--405.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ko, Y., Park, J., and Seo, J. 2004. Improving text categorization using the importance of sentences. Inf. Proc. Manage. 40, 65--79.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kondrak, G. 2005. N-gram similarity and distance. In Proceedings of the 12h International Conference on String Processing and Information Retrieval (Buenos Aires, Argentina). 115--126.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Landauer, T. and Dumais, S. 1997. A solution to platos problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psych. Rev. 104, 2, 211--240.]]Google ScholarGoogle ScholarCross RefCross Ref
  21. Landauer, T., Foltz, P., and Laham, D. 1998. Introduction to latent semantic analysis. Dis. Proc. 25, 2--3, 259--284.]]Google ScholarGoogle ScholarCross RefCross Ref
  22. Lapata, M. and Barzilay, R. 2005. Automatic evaluation of text coherence: Models and representations. In Proceedings of the 19th International Joint Conference on AI.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Leacock, C. and Chodorow, M. 1998. WordNet: An electronic lexical database. MIT Press, Chapter Combining local context andWordNet similarity for word sense identification, 265--283.]]Google ScholarGoogle Scholar
  24. Lesk, M. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the SIGDOC Conference.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Li, Y., Bandar, Z., and Mclean, D. 2003. An approach for measuring semantic similarity using multiple information sources. IEEE Trans. Knowl. Data Eng. 15, 4, 871--882.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Li, Y., McLean, D., Bandar, Z., O'Shea, J., and Crockett, K. 2006. Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18, 8, 1138--1149.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Lin, C. and Hovy, E. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the Human Language Technology Conference.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Lin, D. 1998. An information-theoretic definition of similarity. In Proceedings of the International Conference on Machine Learning.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Liu, T. and Guo, J. 2005. Text similarity computing based on standard deviation. In Proceedings of the International Conference on Intelligent Computing, D.-S. Huang, X.-P. Zhang, and G.-B. Huang, Eds. Lecture Notes in Computer Science, vol. 3644. Springer-Verlag, New York, 456--464.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Liu, Y. and Zong, C. 2004. Example-based chinese-english mt. In Proceedings of the 2004 IEEE International Conference on Systems, Man, and Cybernetics. Vol. 1--7. IEEE Computer Society Press, Los Alamitos, CA, 6093--6096.]]Google ScholarGoogle Scholar
  31. Madhavan, J., Bernstein, P., Doan, A., and Halevy, A. 2005. Corpus-based schema matching. In Proceedings of the International Conference on Data Engineering.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Maguitman, A., Menczer, F., Roinestad, H., and Vespignani, A. 2005. Algorithmic detection of semantic similarity. In Proceedings of the 14th International World Wide Web Conference.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Meadow, C., Boyce, B., and Kraft, D. 2000. Text Information Retrieval Systems, second ed. Academic Press.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Melamed, I. D. 1999. Bitext maps and alignment via pattern recognition. Computat. Linguist. 25, 1, 107--130.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Mihalcea, R., Corley, C., and Strapparava, C. 2006. Corpus-based and knowledge-based measures of text semantic similarity. In Proceedings of the American Association for Artificial Intelligence. (Boston, MA).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Miller, G., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. 1993. Introduction to wordnet: An on-line lexical database. Tech. Rep. 43, Cognitive Science Laboratory, Princeton University, Princeton, NJ.]]Google ScholarGoogle Scholar
  37. Miller, G. A. and Charles, W. G. 1991. Contextual correlates of semantic similarity. Lang. and Cognitive Processes 6, 1, 1--28.]]Google ScholarGoogle ScholarCross RefCross Ref
  38. Papineni, K., Roukos, S., Ward, T., and Zhu, W. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting Association for Computational Linguistics.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Park, E., Ra, D., and Jang, M. 2005. Techniques for improving web retrieval effectiveness. Inf. Processing and Management 41, 5, 1207--1223.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Resnik, P. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on AI.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Rodriguez, M. A. and Egenhofer, M. J. 2003. Determining semantic similarity among entity classes from different ontologies. IEEE Trans. Knowl. Data Eng. 15, 2, 442--456.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Rubenstein, H. and Goodenough, J. B. 1965. Contextual correlates of synonymy. Comm. ACM 8, 10, 627--633.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Salton, G. and Lesk, M. 1971. Computer Evaluation of Indexing and Text Processing. Prentice Hall, Inc. Englewood Cliffs, NJ.]]Google ScholarGoogle Scholar
  44. Schallehn, E., Sattler, K., and Saake, G. 2004. Efficient similarity-based operations for data integration. Data Knowl. Eng. 48, 361--387.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Schutze, H. 1998. Automatic word sense discrimination. Computat. Linguist. 24, 1, 97--124.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Sinclair, J., Ed. 2001. Collins Cobuild English Dictionary for Advanced Learners, third ed. Harper Collins.]]Google ScholarGoogle Scholar
  47. Turney, P. 2001. Mining the web for synonyms: Pmi-ir versus LSA on TOEFL. In Proceedings of the 12th European Conference on Machine Learning.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Weeds, J., Weir, D., and McCarthy, D. 2004. Characterising measures of lexical distributional similarity. In Proceedings of the 20th International Conference on Computational Linguistics. 1015--1021.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Wiemer-Hastings, P. 2000. Adding syntactic information to lsa. In Proceedings of the 22nd Annual Conference Cognitive Science Society. 989--993.]]Google ScholarGoogle Scholar
  50. Wu, Z. and Palmer, M. 1994. Verb semantics and lexical selection. In Proceedings of the Annual Meeting Association for Computational Linguistics.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Semantic text similarity using corpus-based word similarity and string similarity

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Knowledge Discovery from Data
        ACM Transactions on Knowledge Discovery from Data  Volume 2, Issue 2
        July 2008
        152 pages
        ISSN:1556-4681
        EISSN:1556-472X
        DOI:10.1145/1376815
        Issue’s Table of Contents

        Copyright © 2008 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 July 2008
        • Accepted: 1 May 2008
        • Revised: 1 April 2008
        • Received: 1 May 2007
        Published in tkdd Volume 2, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader