Article

Information distance from a question to an answer

Authors:
Xian Zhang

Tsinghua University

Tsinghua University
View Profile

,
Yu Hao

Tsinghua University

Tsinghua University
View Profile

,
Xiaoyan Zhu

Tsinghua University

Tsinghua University
View Profile

,
Ming Li

University of Waterloo

University of Waterloo
View Profile

,
David R. Cheriton

University of Waterloo

University of Waterloo
View Profile

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2007Pages 874–883https://doi.org/10.1145/1281192.1281285

Published:12 August 2007Publication History

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 874–883

ABSTRACT

We provide three key missing pieces of a general theory of information distance [3, 23, 24]. We take bold steps in formulating a revised theory to avoid some pitfalls in practical applications. The new theory is then used to construct a question answering system. Extensive experiments are conducted to justify the new theory.

Supplemental Material

p874-zhang-200.mov

mov

31.1 MB

Download

p874-zhang-768.mov

mov

103.6 MB

Download

References

C. Ané and M. J. Sanderson, Missing the Forest for the Trees: Phylogenetic Compression and Its Implications for Inferring Complex Evolutionary Histories, Systematic Biology, 54:1(2005), 146--157.Google ScholarCross Ref
T. Arbuchle, A. Balaban, D. K. Peters, M. Lawford, Software documents: comparison and measurement. To appear in SEKE 2007.Google Scholar
C. H. Bennett, P. Gacs, M. Li, P. Vitanyi, and W. Zurek, Information Distance, IEEE Trans. Inform. Theory, 44:4(July 1998), 1407--1423. (STOC'93) Google ScholarDigital Library
C. H. Bennett, M. Li and B. Ma, Chain letters and evolutionary histories, Scientific American, 288:6(June 2003) (feature article), 76--81.Google ScholarCross Ref
D. Benedetto, E. Caglioti, V. Loreto, Language trees and zipping, Phys. Rev. Lett., 88:4(2002), 048702.Google ScholarCross Ref
C. C. Chang and C. J. Lin, LIBSVM: a library for support vector machines, 2001, http://www.csie.ntu.edu.tw/cjlin/libsvm.Google Scholar
P. Cimiano, S. Staab, Learning by googling, ACM SIGKDD explorations newsletter, 6:2(2004), 24--33. Google ScholarDigital Library
X. Chen, B. Francia, M. Li, B. Mckinnon, A. Seker. Shared information and program plagiarism detection. IEEE Trans. Information Theory, 50:7(July 2004), 1545--1550. Google ScholarDigital Library
A. V. Chernov, An. A. Muchnik, A. E. Romashchenko, A. K. Shen, N. K. Vereshchagin, Upper semi-lattice of binary strings with the relation "x is simple conditional to y", Theoret. Comput. Sci., 271(2002), 69--95. Google ScholarDigital Library
R. Cilibrasi, P. M. B. Vitányi, and R. de Wolf Algorithmic clustring of music based on string compression. Comput. Music J., 28:4(2004), 49--67. Google ScholarDigital Library
R. Cilibrasi and P. M. B. Vitányi, The Google similarity distance, IEEE Trans. Knowledge and Data Engineering, 19:3(2007), 370--383. Google ScholarDigital Library
R. Cilibrasi and P. M. B. Vitányi, Clustering by compression, IEEE Trans. Inform. Theory, 51:4(2005), 1523--1545. Google ScholarDigital Library
C. Clarke, G. V. Cormack, G. Kemkes, M. Laszlo, T. R. Lynam, E. L. Terra, and P. L. Tilker, Statistical selection of exact answers (multitext experiments for TREC 2002). Report, University of Waterloo, 2002.Google Scholar
M. Cuturi and J. P. Vert, The context-tree kernel for strings. Neural Networks, 18:4(2005), 1111--1123. Google ScholarDigital Library
K. Emanuel, S. Ravela, E. Vivant, C. Risi, A combined statistical-deterministic approach of hurricane risk assessment, manuscript, Program in Atmospheres, Oceans, and Climate, MIT, 2005.Google Scholar
J. R. Finkel, T. Grenager and C. Manning, Incorporating non-local information into information extraction systems by Gibbs sampling, Proc. 43rd Annual Meeting of ACL, 2005, pp. 363--370. Google ScholarDigital Library
R. Fagin and L. Stockmeyer, Relaxing the triangle inequality in pattern matching. Int'l J. Comput. Vision, 28:3(1998), 219--231. Google ScholarDigital Library
E. J. Keogh, S. Lonardi, C. A. Ratanamahatana, Towards parameter-free data mining, in: Proceedings of KDD'2004, 2004, pp. 206--215. Google ScholarDigital Library
S. R. Kirk and S. Jenkins, Information theory-baed software metrics and obfuscation, J. Systems and Software, 72(2004), 179--186.Google ScholarCross Ref
A. Kraskov, H. Stögbauer, R. G. Andrzejak, and P. Grassberger, Hierarchical clustering using mutual information, Europhys. Lett. 70:2(2005), 278--284.Google ScholarCross Ref
A. Kocsor, A. Kertesz-Farkas, L. Kajan, S. Pongor, Application of compression-based distance measures to protein sequence classification: a methodology study, Bioinformatics, 22:4(2006), 407--412. Google ScholarDigital Library
N. Krasnogor, D. A. Pelta, Measuring the similarity of protein structures by means of the universal similarity metric. Bioinformatics 20:7(2004), 1015--1021. Google ScholarDigital Library
M. Li, J. Badger, X. Chen, S. Kwong, P. Kearney, H. Zhang, An nformation-based sequence distance and its application to whole mitochondrial genome phylogeny, Bioinformatics, 17:2(2001), 149--154.Google ScholarCross Ref
M. Li, X. Chen, X. Li, B. Ma, P. Vitányi, The similarity metric, IEEE Trans. Information Theory, 50:12(2004), 3250--3264. Google ScholarDigital Library
M. Li and P. Vitányi, An introduction to Kolmogorov complexity and its applications, Springer-Verlag, 2nd Edition 1997. Google ScholarDigital Library
J. Lin and B. Katz, Question answering from the web using knowledge annotation and knowledge mining techniques, Proc. 12th Int'l Conf. Information and Knowledge Management, 2003, pp. 116--123. Google ScholarDigital Library
J. Lin and B. Katz, Building a reusable test collection for question answering, Journal of the American Society for Information Science and Technology, 57:7(2006), 851--861. Google ScholarDigital Library
J. Lin. The web as a resource for question answering: perspectives and challenges. In 3rd Int'l Conf. Language Resources and Evolution. May, 2002.Google Scholar
An. A. Muchnik, Conditional comlexity and codes, Theoretical Computer Science 271:1(2002), 97--109. Google ScholarDigital Library
An. A. Muchnik and N. K. Vereshchagin, Logical operations and Kolmogorov complexity II. Proc. 16th Conf. Comput. Complexity, 2001, pp. 256--265. Google ScholarDigital Library
H .H. Otu and K. Sayood, Bioinformatics 19:6(2003), 2122--2130. A new sequence distance measure for phylogenetic tree construction.Google Scholar
H. K. Pao and J. Case, Computing entropy for ortholog detection. Int'l Conf. Comput. Intell. Dec. 17--19, 2004, Istanbul Turkey.Google Scholar
D. Parry, Use of Kolmogorov distance identification of web page authorship, topic and domain, Workshop on Open Source Web Inf. Retrieval, 2005.Google Scholar
L. Ramshaw and M. Marcus, Text chunking using transformation-based learning, Proc. 3rd Workshop on Very Large Corpora, pp. 82--94, 1995.Google Scholar
C. C. Santos, J. Bernardes, P. M. B. Vitányi, L. Antunes, Clustering fetal heart rate tracings by compression, Proc. 19th IEEE Intn'l Symp. Computer-Based Medical Systems, Salt Lake City, Utah, 22-23 June, 2006. Google ScholarDigital Library
A. K. Shen and N. K. Vereshchagin, Logical operations and Kolmogorov complexity, Theoret. Comput. Sci. 271(2002), 125--129. Google ScholarDigital Library
A. Siebes, Z. Struzik, Mining using patterns, in: D. Hand, N. Adams, R. Bolton (Eds.), Proceedings of the ESF Exploratory Workshop, London, 2002, p. 14.Google Scholar
W. Taha, S. Crosby, and K. Swadi, A new approach to data mining for software design, manuscript, Rice Univ. 2006.Google Scholar
P. N. Tan, V. Kumar, and J. Srivastava, Selecting the right interestingness measure for association patterns. SIGKDD'02, Edmonton, Alberta, Canada. pp. 32--44. Google ScholarDigital Library
Y. Tsuruoka and J. Tsujii, Bidirectional inference with the easiest-first strategy for tagging sequence data, Proc. HLT/EMNLP'05, Vancouver, October 2005, pp. 467--474. Google ScholarDigital Library
R. C. Veltkamp, Shape Matching: Similarity Measures and Algorithms, invited talk, Proc. Int 'l Conf. Shape Modeling Applications 2001, pp. 188--197, Italy. Google ScholarDigital Library
N. K. Vereshchagin and M. V. V'yugin, Independent minimum length programs to translate between given strings, Theoret. Comput. Sci., 271(2002), 131--143. Google ScholarDigital Library
M. V. V'yugin, Information distance and conditional complexities Theoret. Comput. Sci., 271(2002), 145--150. Google ScholarDigital Library

Index Terms

Information distance from a question to an answer
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Question answering
2. Theory of computation
  1. Logic
  2. Models of computation

Recommendations

New information distance measure and its application in question answering system

In a question answering (QA) system, the fundamental problem is how to measure the distance between a question and an answer, hence ranking different answers. We demonstrate that such a distance can be precisely and mathematically defined. Not only such ...
Read More
Nonapproximability of the normalized information distance

Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program. This practical ...
Read More
Information Distance in Multiples

Information distance is a parameter-free similarity measure based on compression, used in pattern recognition, data mining, phylogeny, clustering and classification. The notion of information distance is extended from pairs to multiples (finite lists). ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2007
1080 pages
ISBN:9781595936097
DOI:10.1145/1281192
General Chair:
Pavel Berkhin
Yahoo!, USA
,
Program Chairs:
Rich Caruana
Cornell University, USA
,
Xindong Wu
University of Vermont, USA
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 August 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
information distance
normalized information distance
question answering system
Qualifiers
- Article
Conference

Acceptance Rates
KDD '07 Paper Acceptance Rate111of573submissions,19%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 23
  Total Citations
  View Citations
- 1,649
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Information distance from a question to an answer

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

New information distance measure and its application in question answering system

Nonapproximability of the normalized information distance

Information Distance in Multiples