skip to main content
10.1145/3077136.3080789acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article
Open Access
Best Paper

BitFunnel: Revisiting Signatures for Search

Published:07 August 2017Publication History

ABSTRACT

Since the mid-90s there has been a widely-held belief that signature files are inferior to inverted files for text indexing. In recent years the Bing search engine has developed and deployed an index based on bit-sliced signatures. This index, known as BitFunnel, replaced an existing production system based on an inverted index. The driving factor behind the shift away from the inverted index was operational cost savings. This paper describes algorithmic innovations and changes in the cloud computing landscape that led us to reconsider and eventually field a technology that was once considered unusable. The BitFunnel algorithm directly addresses four fundamental limitations in bit-sliced block signatures. At the same time, our mapping of the algorithm onto a cluster offers opportunities to avoid other costs associated with signatures. We show these innovations yield a significant efficiency gain versus classic bit-sliced signatures and then compare BitFunnel with Partitioned Elias-Fano Indexes, MG4J, and Lucene.

References

  1. Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Sergey Brin and Lawrence Page. 1998. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks 30, 1-7 (1998), 107--117.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jehoshua Bruck, Jie Gao, and Anxiao Jiang. 2006. Weighted bloom filter. In 2006 IEEE International Symposium on Information Theory. IEEE. Google ScholarGoogle ScholarCross RefCross Ref
  4. Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack. 2016. Information retrieval: Implementing and evaluating search engines. Mit Press.Google ScholarGoogle Scholar
  5. Berkant Barla Cambazoglu and Ricardo A. Baeza-Yates. 2015. Scalability Challenges in Web Search Engines. Morgan & Claypool Publishers.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Shane Culpepper and Alistair Moffat. 2010. Efficient set intersection for inverted indexing. ACM Transactions on Information Systems (TOIS) 29, 1 (2010), 1.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bolin Ding and Arnd Christian König. 2011. Fast set intersection in memory. Proceedings of the VLDB Endowment 4, 4 (2011), 255--266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chris Faloutsos. 1985. Access methods for text. ACM Computing Surveys (CSUR) 17, 1 (1985), 49--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Christos Faloutsos. 1992. Information retrieval: data structures and algorithms. Prentice Hall PTR, 44--65.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Christos Faloutsos and Stavros Christodoulakis. 1985. Design of a Signature File Method that Accounts for Non-Uniform Occurrence and Query Frequencies. In VLDB. 165--170.Google ScholarGoogle Scholar
  11. Edward Fox, Donna Harman, w. Lee, and Ricardo Baeza-Yates. 1992. Information retrieval: data structures and algorithms. Prentice Hall PTR, 28--43.Google ScholarGoogle Scholar
  12. Shlomo Geva and Christopher M. De Vries. 2011. Topsig: Topology preserving document signatures. In Proceedings of the 20th ACM international conference on Information and knowledge management. ACM, 333--338. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Andrew Kane and Frank Wm. Tompa. 2014. Skewed partial bitvectors for list intersection. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. ACM, 263--272.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Kent, Ron Sacks-Davis, and Kotagiri Ramamohanarao. 1990. A signature file scheme based on multiple organizations for indexing very large text databases. Journal of the American Society for Information Science 41, 7 (1990), 508.Google ScholarGoogle ScholarCross RefCross Ref
  15. Donald E. Knuth. 1998. The Art of Computer Programming, Vol. 3, Sorting and Searching (2nd ed.). Vol. 3. Addison-Wesley, 567--573.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Roberto Konow, Gonzalo Navarro, Charles L. A. Clarke, and Alejandro López-Ortíz. 2013. Faster and smaller inverted indices with treaps. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. ACM, 193--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Daniel Lemire and Leonid Boytsov. 2015. Decoding billions of integers per second through vectorization. Software: Practice and Experience 45, 1 (2015), 1--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jimmy Lin, Matt Crane, Andrew Trotman, Jamie Callan, Ishan Chattopadhyaya, John Foley, Grant Ingersoll, Craig Macdonald, and Sebastiano Vigna. 2016. Toward reproducible baselines: The open-source ir reproducibility challenge. In European Conference on Information Retrieval. Springer, 408--420. Google ScholarGoogle ScholarCross RefCross Ref
  19. Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina. 2001. Building a distributed full-text index for the Web. In Proceedings of the Tenth International World Wide Web Conference, WWW 10, Hong Kong, China, May 1-5, 2001. 396--406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Alistair Moffat and Justin Zobel. 1996. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems (TOIS) 14, 4 (1996), 349--379. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Calvin N Mooers. 1948. Application of random codes to the gathering of statistical information. Ph.D. Dissertation. Massachusetts Institute of Technology.Google ScholarGoogle Scholar
  22. Calvin N. Mooers. 1951. Zatocoding applied to mechanical organization of knowledge. American documentation 2, 1 (1951), 20--32. Google ScholarGoogle ScholarCross RefCross Ref
  23. Giuseppe Ottaviano and Rossano Venturini. 2014. Partitioned elias-fano indexes. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. ACM, 273--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Knut Magne Risvik, Trishul M. Chilimbi, Henry Tan, Karthik Kalyanaraman, and Chris Anderson. 2013. Maguro, a system for indexing and searching over very large text collections. In Sixth ACM International Conference on Web Search and Data Mining, WSDM 2013, Rome, Italy, February 4-8, 2013. 727--736. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Charles S. Roberts. 1979. Partial-match retrieval via the method of superimposed codes. Proc. IEEE 67, 12 (1979), 1624--1642. Google ScholarGoogle ScholarCross RefCross Ref
  26. Ron Sacks-Davis, A. Kent, and Kotagiri Ramamohanarao. 1987. Multikey access methods based on superimposed coding techniques. ACM Transactions on Database Systems (TODS) 12, 4 (1987), 655--696.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Harry K. T. Wong, Hsiu-Fen Liu, Frank Olken, Doron Rotem, and Linda Wong. 1985. Bit Transposed Files.. In VLDB, Vol. 85. Citeseer, 448--457.Google ScholarGoogle Scholar
  28. Justin Zobel, Alistair Moffat, and Kotagiri Ramamohanarao. 1998. Inverted files versus signature files for text indexing. ACM Transactions on Database Systems (TODS) 23, 4 (1998), 453--490 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. BitFunnel: Revisiting Signatures for Search

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval
            August 2017
            1476 pages
            ISBN:9781450350228
            DOI:10.1145/3077136

            Copyright © 2017 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 7 August 2017

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            SIGIR '17 Paper Acceptance Rate78of362submissions,22%Overall Acceptance Rate792of3,983submissions,20%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader