skip to main content
10.1145/1281192.1281285acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Information distance from a question to an answer

Published:12 August 2007Publication History

ABSTRACT

We provide three key missing pieces of a general theory of information distance [3, 23, 24]. We take bold steps in formulating a revised theory to avoid some pitfalls in practical applications. The new theory is then used to construct a question answering system. Extensive experiments are conducted to justify the new theory.

Skip Supplemental Material Section

Supplemental Material

p874-zhang-200.mov

mov

31.1 MB

p874-zhang-768.mov

mov

103.6 MB

References

  1. C. Ané and M. J. Sanderson, Missing the Forest for the Trees: Phylogenetic Compression and Its Implications for Inferring Complex Evolutionary Histories, Systematic Biology, 54:1(2005), 146--157.Google ScholarGoogle ScholarCross RefCross Ref
  2. T. Arbuchle, A. Balaban, D. K. Peters, M. Lawford, Software documents: comparison and measurement. To appear in SEKE 2007.Google ScholarGoogle Scholar
  3. C. H. Bennett, P. Gacs, M. Li, P. Vitanyi, and W. Zurek, Information Distance, IEEE Trans. Inform. Theory, 44:4(July 1998), 1407--1423. (STOC'93) Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. H. Bennett, M. Li and B. Ma, Chain letters and evolutionary histories, Scientific American, 288:6(June 2003) (feature article), 76--81.Google ScholarGoogle ScholarCross RefCross Ref
  5. D. Benedetto, E. Caglioti, V. Loreto, Language trees and zipping, Phys. Rev. Lett., 88:4(2002), 048702.Google ScholarGoogle ScholarCross RefCross Ref
  6. C. C. Chang and C. J. Lin, LIBSVM: a library for support vector machines, 2001, http://www.csie.ntu.edu.tw/cjlin/libsvm.Google ScholarGoogle Scholar
  7. P. Cimiano, S. Staab, Learning by googling, ACM SIGKDD explorations newsletter, 6:2(2004), 24--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. X. Chen, B. Francia, M. Li, B. Mckinnon, A. Seker. Shared information and program plagiarism detection. IEEE Trans. Information Theory, 50:7(July 2004), 1545--1550. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. V. Chernov, An. A. Muchnik, A. E. Romashchenko, A. K. Shen, N. K. Vereshchagin, Upper semi-lattice of binary strings with the relation "x is simple conditional to y", Theoret. Comput. Sci., 271(2002), 69--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Cilibrasi, P. M. B. Vitányi, and R. de Wolf Algorithmic clustring of music based on string compression. Comput. Music J., 28:4(2004), 49--67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Cilibrasi and P. M. B. Vitányi, The Google similarity distance, IEEE Trans. Knowledge and Data Engineering, 19:3(2007), 370--383. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Cilibrasi and P. M. B. Vitányi, Clustering by compression, IEEE Trans. Inform. Theory, 51:4(2005), 1523--1545. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Clarke, G. V. Cormack, G. Kemkes, M. Laszlo, T. R. Lynam, E. L. Terra, and P. L. Tilker, Statistical selection of exact answers (multitext experiments for TREC 2002). Report, University of Waterloo, 2002.Google ScholarGoogle Scholar
  14. M. Cuturi and J. P. Vert, The context-tree kernel for strings. Neural Networks, 18:4(2005), 1111--1123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. K. Emanuel, S. Ravela, E. Vivant, C. Risi, A combined statistical-deterministic approach of hurricane risk assessment, manuscript, Program in Atmospheres, Oceans, and Climate, MIT, 2005.Google ScholarGoogle Scholar
  16. J. R. Finkel, T. Grenager and C. Manning, Incorporating non-local information into information extraction systems by Gibbs sampling, Proc. 43rd Annual Meeting of ACL, 2005, pp. 363--370. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Fagin and L. Stockmeyer, Relaxing the triangle inequality in pattern matching. Int'l J. Comput. Vision, 28:3(1998), 219--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E. J. Keogh, S. Lonardi, C. A. Ratanamahatana, Towards parameter-free data mining, in: Proceedings of KDD'2004, 2004, pp. 206--215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. R. Kirk and S. Jenkins, Information theory-baed software metrics and obfuscation, J. Systems and Software, 72(2004), 179--186.Google ScholarGoogle ScholarCross RefCross Ref
  20. A. Kraskov, H. Stögbauer, R. G. Andrzejak, and P. Grassberger, Hierarchical clustering using mutual information, Europhys. Lett. 70:2(2005), 278--284.Google ScholarGoogle ScholarCross RefCross Ref
  21. A. Kocsor, A. Kertesz-Farkas, L. Kajan, S. Pongor, Application of compression-based distance measures to protein sequence classification: a methodology study, Bioinformatics, 22:4(2006), 407--412. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. N. Krasnogor, D. A. Pelta, Measuring the similarity of protein structures by means of the universal similarity metric. Bioinformatics 20:7(2004), 1015--1021. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Li, J. Badger, X. Chen, S. Kwong, P. Kearney, H. Zhang, An nformation-based sequence distance and its application to whole mitochondrial genome phylogeny, Bioinformatics, 17:2(2001), 149--154.Google ScholarGoogle ScholarCross RefCross Ref
  24. M. Li, X. Chen, X. Li, B. Ma, P. Vitányi, The similarity metric, IEEE Trans. Information Theory, 50:12(2004), 3250--3264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Li and P. Vitányi, An introduction to Kolmogorov complexity and its applications, Springer-Verlag, 2nd Edition 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Lin and B. Katz, Question answering from the web using knowledge annotation and knowledge mining techniques, Proc. 12th Int'l Conf. Information and Knowledge Management, 2003, pp. 116--123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Lin and B. Katz, Building a reusable test collection for question answering, Journal of the American Society for Information Science and Technology, 57:7(2006), 851--861. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Lin. The web as a resource for question answering: perspectives and challenges. In 3rd Int'l Conf. Language Resources and Evolution. May, 2002.Google ScholarGoogle Scholar
  29. An. A. Muchnik, Conditional comlexity and codes, Theoretical Computer Science 271:1(2002), 97--109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. An. A. Muchnik and N. K. Vereshchagin, Logical operations and Kolmogorov complexity II. Proc. 16th Conf. Comput. Complexity, 2001, pp. 256--265. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. H .H. Otu and K. Sayood, Bioinformatics 19:6(2003), 2122--2130. A new sequence distance measure for phylogenetic tree construction.Google ScholarGoogle Scholar
  32. H. K. Pao and J. Case, Computing entropy for ortholog detection. Int'l Conf. Comput. Intell. Dec. 17--19, 2004, Istanbul Turkey.Google ScholarGoogle Scholar
  33. D. Parry, Use of Kolmogorov distance identification of web page authorship, topic and domain, Workshop on Open Source Web Inf. Retrieval, 2005.Google ScholarGoogle Scholar
  34. L. Ramshaw and M. Marcus, Text chunking using transformation-based learning, Proc. 3rd Workshop on Very Large Corpora, pp. 82--94, 1995.Google ScholarGoogle Scholar
  35. C. C. Santos, J. Bernardes, P. M. B. Vitányi, L. Antunes, Clustering fetal heart rate tracings by compression, Proc. 19th IEEE Intn'l Symp. Computer-Based Medical Systems, Salt Lake City, Utah, 22-23 June, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. K. Shen and N. K. Vereshchagin, Logical operations and Kolmogorov complexity, Theoret. Comput. Sci. 271(2002), 125--129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A. Siebes, Z. Struzik, Mining using patterns, in: D. Hand, N. Adams, R. Bolton (Eds.), Proceedings of the ESF Exploratory Workshop, London, 2002, p. 14.Google ScholarGoogle Scholar
  38. W. Taha, S. Crosby, and K. Swadi, A new approach to data mining for software design, manuscript, Rice Univ. 2006.Google ScholarGoogle Scholar
  39. P. N. Tan, V. Kumar, and J. Srivastava, Selecting the right interestingness measure for association patterns. SIGKDD'02, Edmonton, Alberta, Canada. pp. 32--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Y. Tsuruoka and J. Tsujii, Bidirectional inference with the easiest-first strategy for tagging sequence data, Proc. HLT/EMNLP'05, Vancouver, October 2005, pp. 467--474. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. R. C. Veltkamp, Shape Matching: Similarity Measures and Algorithms, invited talk, Proc. Int 'l Conf. Shape Modeling Applications 2001, pp. 188--197, Italy. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. N. K. Vereshchagin and M. V. V'yugin, Independent minimum length programs to translate between given strings, Theoret. Comput. Sci., 271(2002), 131--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. M. V. V'yugin, Information distance and conditional complexities Theoret. Comput. Sci., 271(2002), 145--150. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Information distance from a question to an answer

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
          August 2007
          1080 pages
          ISBN:9781595936097
          DOI:10.1145/1281192

          Copyright © 2007 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 12 August 2007

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          KDD '07 Paper Acceptance Rate111of573submissions,19%Overall Acceptance Rate1,133of8,635submissions,13%

          Upcoming Conference

          KDD '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader