ABSTRACT
We provide three key missing pieces of a general theory of information distance [3, 23, 24]. We take bold steps in formulating a revised theory to avoid some pitfalls in practical applications. The new theory is then used to construct a question answering system. Extensive experiments are conducted to justify the new theory.
Supplemental Material
- C. Ané and M. J. Sanderson, Missing the Forest for the Trees: Phylogenetic Compression and Its Implications for Inferring Complex Evolutionary Histories, Systematic Biology, 54:1(2005), 146--157.Google ScholarCross Ref
- T. Arbuchle, A. Balaban, D. K. Peters, M. Lawford, Software documents: comparison and measurement. To appear in SEKE 2007.Google Scholar
- C. H. Bennett, P. Gacs, M. Li, P. Vitanyi, and W. Zurek, Information Distance, IEEE Trans. Inform. Theory, 44:4(July 1998), 1407--1423. (STOC'93) Google ScholarDigital Library
- C. H. Bennett, M. Li and B. Ma, Chain letters and evolutionary histories, Scientific American, 288:6(June 2003) (feature article), 76--81.Google ScholarCross Ref
- D. Benedetto, E. Caglioti, V. Loreto, Language trees and zipping, Phys. Rev. Lett., 88:4(2002), 048702.Google ScholarCross Ref
- C. C. Chang and C. J. Lin, LIBSVM: a library for support vector machines, 2001, http://www.csie.ntu.edu.tw/cjlin/libsvm.Google Scholar
- P. Cimiano, S. Staab, Learning by googling, ACM SIGKDD explorations newsletter, 6:2(2004), 24--33. Google ScholarDigital Library
- X. Chen, B. Francia, M. Li, B. Mckinnon, A. Seker. Shared information and program plagiarism detection. IEEE Trans. Information Theory, 50:7(July 2004), 1545--1550. Google ScholarDigital Library
- A. V. Chernov, An. A. Muchnik, A. E. Romashchenko, A. K. Shen, N. K. Vereshchagin, Upper semi-lattice of binary strings with the relation "x is simple conditional to y", Theoret. Comput. Sci., 271(2002), 69--95. Google ScholarDigital Library
- R. Cilibrasi, P. M. B. Vitányi, and R. de Wolf Algorithmic clustring of music based on string compression. Comput. Music J., 28:4(2004), 49--67. Google ScholarDigital Library
- R. Cilibrasi and P. M. B. Vitányi, The Google similarity distance, IEEE Trans. Knowledge and Data Engineering, 19:3(2007), 370--383. Google ScholarDigital Library
- R. Cilibrasi and P. M. B. Vitányi, Clustering by compression, IEEE Trans. Inform. Theory, 51:4(2005), 1523--1545. Google ScholarDigital Library
- C. Clarke, G. V. Cormack, G. Kemkes, M. Laszlo, T. R. Lynam, E. L. Terra, and P. L. Tilker, Statistical selection of exact answers (multitext experiments for TREC 2002). Report, University of Waterloo, 2002.Google Scholar
- M. Cuturi and J. P. Vert, The context-tree kernel for strings. Neural Networks, 18:4(2005), 1111--1123. Google ScholarDigital Library
- K. Emanuel, S. Ravela, E. Vivant, C. Risi, A combined statistical-deterministic approach of hurricane risk assessment, manuscript, Program in Atmospheres, Oceans, and Climate, MIT, 2005.Google Scholar
- J. R. Finkel, T. Grenager and C. Manning, Incorporating non-local information into information extraction systems by Gibbs sampling, Proc. 43rd Annual Meeting of ACL, 2005, pp. 363--370. Google ScholarDigital Library
- R. Fagin and L. Stockmeyer, Relaxing the triangle inequality in pattern matching. Int'l J. Comput. Vision, 28:3(1998), 219--231. Google ScholarDigital Library
- E. J. Keogh, S. Lonardi, C. A. Ratanamahatana, Towards parameter-free data mining, in: Proceedings of KDD'2004, 2004, pp. 206--215. Google ScholarDigital Library
- S. R. Kirk and S. Jenkins, Information theory-baed software metrics and obfuscation, J. Systems and Software, 72(2004), 179--186.Google ScholarCross Ref
- A. Kraskov, H. Stögbauer, R. G. Andrzejak, and P. Grassberger, Hierarchical clustering using mutual information, Europhys. Lett. 70:2(2005), 278--284.Google ScholarCross Ref
- A. Kocsor, A. Kertesz-Farkas, L. Kajan, S. Pongor, Application of compression-based distance measures to protein sequence classification: a methodology study, Bioinformatics, 22:4(2006), 407--412. Google ScholarDigital Library
- N. Krasnogor, D. A. Pelta, Measuring the similarity of protein structures by means of the universal similarity metric. Bioinformatics 20:7(2004), 1015--1021. Google ScholarDigital Library
- M. Li, J. Badger, X. Chen, S. Kwong, P. Kearney, H. Zhang, An nformation-based sequence distance and its application to whole mitochondrial genome phylogeny, Bioinformatics, 17:2(2001), 149--154.Google ScholarCross Ref
- M. Li, X. Chen, X. Li, B. Ma, P. Vitányi, The similarity metric, IEEE Trans. Information Theory, 50:12(2004), 3250--3264. Google ScholarDigital Library
- M. Li and P. Vitányi, An introduction to Kolmogorov complexity and its applications, Springer-Verlag, 2nd Edition 1997. Google ScholarDigital Library
- J. Lin and B. Katz, Question answering from the web using knowledge annotation and knowledge mining techniques, Proc. 12th Int'l Conf. Information and Knowledge Management, 2003, pp. 116--123. Google ScholarDigital Library
- J. Lin and B. Katz, Building a reusable test collection for question answering, Journal of the American Society for Information Science and Technology, 57:7(2006), 851--861. Google ScholarDigital Library
- J. Lin. The web as a resource for question answering: perspectives and challenges. In 3rd Int'l Conf. Language Resources and Evolution. May, 2002.Google Scholar
- An. A. Muchnik, Conditional comlexity and codes, Theoretical Computer Science 271:1(2002), 97--109. Google ScholarDigital Library
- An. A. Muchnik and N. K. Vereshchagin, Logical operations and Kolmogorov complexity II. Proc. 16th Conf. Comput. Complexity, 2001, pp. 256--265. Google ScholarDigital Library
- H .H. Otu and K. Sayood, Bioinformatics 19:6(2003), 2122--2130. A new sequence distance measure for phylogenetic tree construction.Google Scholar
- H. K. Pao and J. Case, Computing entropy for ortholog detection. Int'l Conf. Comput. Intell. Dec. 17--19, 2004, Istanbul Turkey.Google Scholar
- D. Parry, Use of Kolmogorov distance identification of web page authorship, topic and domain, Workshop on Open Source Web Inf. Retrieval, 2005.Google Scholar
- L. Ramshaw and M. Marcus, Text chunking using transformation-based learning, Proc. 3rd Workshop on Very Large Corpora, pp. 82--94, 1995.Google Scholar
- C. C. Santos, J. Bernardes, P. M. B. Vitányi, L. Antunes, Clustering fetal heart rate tracings by compression, Proc. 19th IEEE Intn'l Symp. Computer-Based Medical Systems, Salt Lake City, Utah, 22-23 June, 2006. Google ScholarDigital Library
- A. K. Shen and N. K. Vereshchagin, Logical operations and Kolmogorov complexity, Theoret. Comput. Sci. 271(2002), 125--129. Google ScholarDigital Library
- A. Siebes, Z. Struzik, Mining using patterns, in: D. Hand, N. Adams, R. Bolton (Eds.), Proceedings of the ESF Exploratory Workshop, London, 2002, p. 14.Google Scholar
- W. Taha, S. Crosby, and K. Swadi, A new approach to data mining for software design, manuscript, Rice Univ. 2006.Google Scholar
- P. N. Tan, V. Kumar, and J. Srivastava, Selecting the right interestingness measure for association patterns. SIGKDD'02, Edmonton, Alberta, Canada. pp. 32--44. Google ScholarDigital Library
- Y. Tsuruoka and J. Tsujii, Bidirectional inference with the easiest-first strategy for tagging sequence data, Proc. HLT/EMNLP'05, Vancouver, October 2005, pp. 467--474. Google ScholarDigital Library
- R. C. Veltkamp, Shape Matching: Similarity Measures and Algorithms, invited talk, Proc. Int 'l Conf. Shape Modeling Applications 2001, pp. 188--197, Italy. Google ScholarDigital Library
- N. K. Vereshchagin and M. V. V'yugin, Independent minimum length programs to translate between given strings, Theoret. Comput. Sci., 271(2002), 131--143. Google ScholarDigital Library
- M. V. V'yugin, Information distance and conditional complexities Theoret. Comput. Sci., 271(2002), 145--150. Google ScholarDigital Library
Index Terms
- Information distance from a question to an answer
Recommendations
New information distance measure and its application in question answering system
In a question answering (QA) system, the fundamental problem is how to measure the distance between a question and an answer, hence ranking different answers. We demonstrate that such a distance can be precisely and mathematically defined. Not only such ...
Nonapproximability of the normalized information distance
Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program. This practical ...
Information Distance in Multiples
Information distance is a parameter-free similarity measure based on compression, used in pattern recognition, data mining, phylogeny, clustering and classification. The notion of information distance is extended from pairs to multiples (finite lists). ...
Comments