skip to main content
article

Probabilistic information retrieval approach for ranking of database query results

Published:01 September 2006Publication History
Skip Abstract Section

Abstract

We investigate the problem of ranking the answers to a database query when many tuples are returned. In particular, we present methodologies to tackle the problem for conjunctive and range queries, by adapting and applying principles of probabilistic models from information retrieval for structured data. Our solution is domain independent and leverages data and workload statistics and correlations. We evaluate the quality of our approach with a user survey on a real database. Furthermore, we present and experimentally evaluate algorithms to efficiently retrieve the top ranked results, which demonstrate the feasibility of our ranking system.

References

  1. Agrawal, S., Chaudhuri, S., and Das, G. 2002. DBXplorer: A system for keyword based search over relational databases. In proceedings of ICDE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Agrawal, S., Chaudhuri, S., Das, G., and Gionis, A. 2003. Automated ranking of database query results. In proceedings of CIDR.Google ScholarGoogle Scholar
  3. Amer-Yahia, S., Case, P., Roelleke, T., Shanmugasundaram, J., and Weikum. G. 2005a. Report on the DB/IR panel at SIGMOD 2005. ACM SIGMOD Rec. 34, 4, 71--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Amer-Yahia, S., Koudas, N., Marian, A., Srivastava, D., and Toman, D. 2005b. Structure and content scoring for XML. In proceedings of VLDB. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and Verkamo, A. I. 1995. Fast discovery of association rules. In proceedings of KDD.Google ScholarGoogle Scholar
  6. Barbara, D., Garcia-Molina, H., and Porter, D. 1992. The management of probabilistic data. IEEE Trans. Knoual. Data Eng. 4, 5, 487--502. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bruno, N., Gravano, L., and Chaudhuri, S. 2002a. Top-k selection queries over relational databases: Mapping strategies and performance evaluation. ACM Trans. Database Syst. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bruno, N., Gravano, L., and Marian, A. 2002b. Evaluating top-k queries over Web-accessible databases. In proceedings of ICDE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Breese, J., Heckerman, D., and Kadie, C. 1998. Empirical analysis of predictive algorithms for collaborative filtering. In proceedings of the 14th Conference on Uncertainty in Artificial Intelligence. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Bhalotia, G., Nakhe, C., Hulgeri, A., Chakrabarti, S., and Sudarshan, S. 2002. Keyword searching and browsing in databases using BANKS. In Proceedings of ICDE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval, 1st ed. Addison-Wesley, Reading, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Cestnik, B. 1990. Estimating probabilities: A crucial task in machine learning. In Proceedings of the European Conference on artificial Intelligence.Google ScholarGoogle Scholar
  13. Cavallo, R. and Pittarelli, M. 1987. The theory of probabilistic databases. In Proceedings of VLDB. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Chaudhuri, S., Das, G., Hristidis, V., and Weikum, G. 2004. Probabilistic ranking of database query results. In Proceedings of VLDB. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Chinenyanga, T. T. and Kushmerick, N. 2002. An expressive and efficient language for XML information retrieval. J. Amer. Soc. Inform. Sci. Tech. 53, 6, 438--453. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Croft, W. B. and Lafferty, J. 2003. Language Modeling for Information Retrieval. Kluwer, Norwell, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Carmel, D, Maarek, Y. S., Mandelbrod, M., Mass, Y., and Soffer, A. 2003. Searching XML documents via XML fragments. In Proceedings of SIGIR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Cohen, W. 1998a. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proceedings of SIGMOD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Cohen, W. 1998b. Providing database-like access to the Web using queries based on textual similarity. In Proceedings of SIGMOD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Chakrabarti, K., Porkaew, K., and Mehrotra, S. 2000. Efficient query references in multimedia databases. In Proceedings of ICDE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Dalvi, N. N. and Suciu, D. 2005. Answering queries from statistics and probabilistic Views. In Proceedings of VLDB. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Fagin, R. 1998. Fuzzy queries in multimedia database systems. In Proceedings of PODS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Fagin, R., Lotem, A., and Naor, M. 2001. Optimal aggregation algorithms for middleware. In Proceedings of PODS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Fuhr, N. 1990. A probabilistic framework for vague queries and imprecise information in databases. In Proceedings of VLDB. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Fuhr, N. 1993. A probabilistic relational model for the integration of IR and databases. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Fuhr, N. and Grossjohann, K. 2004. XIRQL: An XML query language based on information retrieval concepts. ACM Trans. Inform. Syst. 22, 2, 313--356. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Fuhr, N. and Roelleke, T. 1997. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inform. Syst. 15, 1, 32--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Fuhr, N. and Roelleke, T. 1998. HySpirit---a probabilistic inference engine for hypermedia retrieval in large databases. In Proceedings of EDBT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Grossman, D. and Frieder, O. 2004. Information Retrieval---Algorithms and Heuristics. Springer, Berlin, Germany. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Güntzer, U., Balke, W.-T., and Kiessling, W. 2000. Optimizing multi-feature queries for image databases. In Proceedings of VLDB. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Guo, L., Shao, F., Botev, C., and Shanmugasundaram. J. 2003. XRANK: Ranked keyword search over XML documents. In Proceedings of SIGMOD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Harper, D. and Van Rijsbergen, C. J. 1978. An evaluation of feedback in document retrieval using co-occurrence data. J. Document. 34, 3, 189--216.Google ScholarGoogle ScholarCross RefCross Ref
  33. Hristidis, V. and Papakonstantinou, Y. 2002. DISCOVER: Keyword search in relational databases. In Proceedings of VLDB. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Hristidis, V., Gravano, L., and Papakonstantinou, Y. 2003a. Efficient IR-style keyword search over relational databases. In Proceedings of VLDB. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Hristidis, V., Papakonstantinou, Y., and Balmin, A. 2003b. Keyword proximity search on XML graphs. In Proceedings of ICDE.Google ScholarGoogle Scholar
  36. Jagadish, H. V., Poosala, V., Koudas, N., Sevcik, K., Muthukrishnan, S., and Suel, T. 1998. Optimal histograms with quality guarantees. In Proceedings of VLDB. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Kiessling, W. 2002. Foundations of preferences in database systems. In Proceedings of VLDB. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Lakshmanan, L. V. S., Leone, N., Ross, R., and Subrahmanian, V. S. 1997. ProbView: A flexible probabilistic database system. ACM Trans. Database Syst. 22, 3, 419--469. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Lalmas, M. and Roelleke, T. 2004. Modeling vague content and structure querying in XML retrieval with a probabilistic object-relational framework. In Proceedings of FQAS.Google ScholarGoogle Scholar
  40. Martinez, W., Martinez, A., and Wegman, E. 2004. Document classification and clustering using weighted text proximity matrices. In Proceedings of Interface.Google ScholarGoogle Scholar
  41. Motro, A. 1988. VAGUE: A user interface to relational databases that permits vague queries. ACM Trans. Informat. Syst. 6, 3 (July), 187--214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Nazeri, Z., Bloedorn, E., and Ostwald, P. 2001. Experiences in mining aviation safety data. In Proceedings of SIGMOD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Nepal, S. and Ramakrishna, M. V. 1999. Query processing issues in image (multimedia) databases. In Proceedings of ICDE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Ortega-Binderberger, M., Chakrabarti, K., and Mehrotra, S. 2002. An approach to integrating query refinement in SQL. In Proceedings of EDBT. 15--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Poosala, V., Ioannidis, Y. E., Haas, P. J., and Shekita, E. J. 1996. Improved histograms for selectivity estimation of range predicates. In Proceedings of SIGMOD. 294--305. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Radlinski, F. and Joachims, T. 2005. Query chains: Learning to rank from implicit feedback. In Proceedings of KDD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Rui, Y., Huang, T. S., and Mehrotra, S. 1997. Content-based image retrieval with relevance feedback in MARS. In Proceedings of the IEEE Conference on Image Processing.Google ScholarGoogle Scholar
  48. Shen, X., Tan, B. and Zhai, C. 2005. Context-sensitive information retrieval using implicit feedback. In Proceedings of SIGIR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Sparck Jones, K., Walker, S., and Robertson, S. E. 2000a. A probabilistic model of information retrieval: Development and comparative experiments---Part 1. Inf. Process. Man. 36, 6, 779--808. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Sparck Jones, K., Walker, S., and Robertson, S. E. 2000a. A probabilistic model of information retrieval: Development and comparative experiments---Part 2. Inf. Process. Man. 36, 6, 809--840. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Theobald, A. and Weikum, G. 2002. The index-based XXL search engine for querying XML data with relevance ranking. In Proceedings of EDBT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Theobald, M., Schenkel, R., and Weikum, G. 2005. An efficient and versatile query engine for topX search. In Proceedings of VLDB. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Wu, L., Faloutsos, C., Sycara, K., and Payne, T. 2000. FALCON: Feedback adaptive loop for content-based retrieval. In Proceedings of VLDB. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Whittaker, J. 1990. Graphical Models in Applied Multivariate Statistics. Wiley, New York, NY.Google ScholarGoogle Scholar
  55. Widom, J. 2005. Trio: A system for integrated management of data, accuracy, and lineage. CIDR.Google ScholarGoogle Scholar
  56. Wimmers, L., Haas, L. M., Roth, M. T., and Braendli, C. 1999. Using Fagin's algorithm for merging ranked results in multimedia middleware. In Proceedings of CoopIS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Xu, J. and Croft, W. B. 1996. Query expansion using local and global document analysis, In Proceedings of SIGIR. 4--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Yu, C.T. and Meng, W. 1998. Principles of Database Query Processing for Advanced Applications. Morgan Kaufmann, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Probabilistic information retrieval approach for ranking of database query results

      Recommendations

      Reviews

      Maria Bielikova

      When many tuples are returned to a database query, ranking them can be a problem. This paper focuses on the problem of ranking answers to such queries. The problem is similar to the problem of information retrieval, where we operate on documents that are ranked according to a user query. The authors propose a domain-independent automated ranking approach that leverages the structure of the data, database workload statistics, and correlations present in the structured data. A proposal is evaluated experimentally in terms of the quality of results produced and performance. The proposed ranking function depends on a global score that captures the global importance of unspecified attribute values, and on a conditional score that captures the strengths of correlations between specified and unspecified attribute values. Correlations are automatically estimated via workload as well as data analysis. The ranking functions are based on probabilistic information retrieval ranking models. The novelty of this approach is in the use of probabilistic techniques in a systematic way, and in considering dependencies and correlations between data values. I see space for improvement by combining this approach with methods for the automatic discovery of user characteristics and interests, since the authors currently create profiles (that are, in fact, user stereotypes) manually. The paper is well written and properly organized. The intended audience of the paper is researchers in the area of information retrieval, especially people within the database management community. The paper could also be useful for researchers in general Web information retrieval, since some ideas of the proposed approach seem to be suitable for document retrievals that are less structured than those associated with a relational database.

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader