ABSTRACT
Since the mid-90s there has been a widely-held belief that signature files are inferior to inverted files for text indexing. In recent years the Bing search engine has developed and deployed an index based on bit-sliced signatures. This index, known as BitFunnel, replaced an existing production system based on an inverted index. The driving factor behind the shift away from the inverted index was operational cost savings. This paper describes algorithmic innovations and changes in the cloud computing landscape that led us to reconsider and eventually field a technology that was once considered unusable. The BitFunnel algorithm directly addresses four fundamental limitations in bit-sliced block signatures. At the same time, our mapping of the algorithm onto a cluster offers opportunities to avoid other costs associated with signatures. We show these innovations yield a significant efficiency gain versus classic bit-sliced signatures and then compare BitFunnel with Partitioned Elias-Fano Indexes, MG4J, and Lucene.
- Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426. Google ScholarDigital Library
- Sergey Brin and Lawrence Page. 1998. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks 30, 1-7 (1998), 107--117.Google ScholarDigital Library
- Jehoshua Bruck, Jie Gao, and Anxiao Jiang. 2006. Weighted bloom filter. In 2006 IEEE International Symposium on Information Theory. IEEE. Google ScholarCross Ref
- Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack. 2016. Information retrieval: Implementing and evaluating search engines. Mit Press.Google Scholar
- Berkant Barla Cambazoglu and Ricardo A. Baeza-Yates. 2015. Scalability Challenges in Web Search Engines. Morgan & Claypool Publishers.Google ScholarDigital Library
- J. Shane Culpepper and Alistair Moffat. 2010. Efficient set intersection for inverted indexing. ACM Transactions on Information Systems (TOIS) 29, 1 (2010), 1.Google ScholarDigital Library
- Bolin Ding and Arnd Christian König. 2011. Fast set intersection in memory. Proceedings of the VLDB Endowment 4, 4 (2011), 255--266. Google ScholarDigital Library
- Chris Faloutsos. 1985. Access methods for text. ACM Computing Surveys (CSUR) 17, 1 (1985), 49--74. Google ScholarDigital Library
- Christos Faloutsos. 1992. Information retrieval: data structures and algorithms. Prentice Hall PTR, 44--65.Google ScholarDigital Library
- Christos Faloutsos and Stavros Christodoulakis. 1985. Design of a Signature File Method that Accounts for Non-Uniform Occurrence and Query Frequencies. In VLDB. 165--170.Google Scholar
- Edward Fox, Donna Harman, w. Lee, and Ricardo Baeza-Yates. 1992. Information retrieval: data structures and algorithms. Prentice Hall PTR, 28--43.Google Scholar
- Shlomo Geva and Christopher M. De Vries. 2011. Topsig: Topology preserving document signatures. In Proceedings of the 20th ACM international conference on Information and knowledge management. ACM, 333--338. Google ScholarDigital Library
- Andrew Kane and Frank Wm. Tompa. 2014. Skewed partial bitvectors for list intersection. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. ACM, 263--272.Google ScholarDigital Library
- A. Kent, Ron Sacks-Davis, and Kotagiri Ramamohanarao. 1990. A signature file scheme based on multiple organizations for indexing very large text databases. Journal of the American Society for Information Science 41, 7 (1990), 508.Google ScholarCross Ref
- Donald E. Knuth. 1998. The Art of Computer Programming, Vol. 3, Sorting and Searching (2nd ed.). Vol. 3. Addison-Wesley, 567--573.Google ScholarDigital Library
- Roberto Konow, Gonzalo Navarro, Charles L. A. Clarke, and Alejandro López-Ortíz. 2013. Faster and smaller inverted indices with treaps. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. ACM, 193--202. Google ScholarDigital Library
- Daniel Lemire and Leonid Boytsov. 2015. Decoding billions of integers per second through vectorization. Software: Practice and Experience 45, 1 (2015), 1--29. Google ScholarDigital Library
- Jimmy Lin, Matt Crane, Andrew Trotman, Jamie Callan, Ishan Chattopadhyaya, John Foley, Grant Ingersoll, Craig Macdonald, and Sebastiano Vigna. 2016. Toward reproducible baselines: The open-source ir reproducibility challenge. In European Conference on Information Retrieval. Springer, 408--420. Google ScholarCross Ref
- Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina. 2001. Building a distributed full-text index for the Web. In Proceedings of the Tenth International World Wide Web Conference, WWW 10, Hong Kong, China, May 1-5, 2001. 396--406. Google ScholarDigital Library
- Alistair Moffat and Justin Zobel. 1996. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems (TOIS) 14, 4 (1996), 349--379. Google ScholarDigital Library
- Calvin N Mooers. 1948. Application of random codes to the gathering of statistical information. Ph.D. Dissertation. Massachusetts Institute of Technology.Google Scholar
- Calvin N. Mooers. 1951. Zatocoding applied to mechanical organization of knowledge. American documentation 2, 1 (1951), 20--32. Google ScholarCross Ref
- Giuseppe Ottaviano and Rossano Venturini. 2014. Partitioned elias-fano indexes. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. ACM, 273--282. Google ScholarDigital Library
- Knut Magne Risvik, Trishul M. Chilimbi, Henry Tan, Karthik Kalyanaraman, and Chris Anderson. 2013. Maguro, a system for indexing and searching over very large text collections. In Sixth ACM International Conference on Web Search and Data Mining, WSDM 2013, Rome, Italy, February 4-8, 2013. 727--736. Google ScholarDigital Library
- Charles S. Roberts. 1979. Partial-match retrieval via the method of superimposed codes. Proc. IEEE 67, 12 (1979), 1624--1642. Google ScholarCross Ref
- Ron Sacks-Davis, A. Kent, and Kotagiri Ramamohanarao. 1987. Multikey access methods based on superimposed coding techniques. ACM Transactions on Database Systems (TODS) 12, 4 (1987), 655--696.Google ScholarDigital Library
- Harry K. T. Wong, Hsiu-Fen Liu, Frank Olken, Doron Rotem, and Linda Wong. 1985. Bit Transposed Files.. In VLDB, Vol. 85. Citeseer, 448--457.Google Scholar
- Justin Zobel, Alistair Moffat, and Kotagiri Ramamohanarao. 1998. Inverted files versus signature files for text indexing. ACM Transactions on Database Systems (TODS) 23, 4 (1998), 453--490 Google ScholarDigital Library
Index Terms
- BitFunnel: Revisiting Signatures for Search
Recommendations
Inverted indexes vs. bitmap indexes in decision support systems
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementBitmap indexes are widely used in Decision Support Systems (DSSs) to improve query performance. In this paper, we evaluate the use of compressed inverted indexes with adapted query processing strategies from Information Retrieval as an alternative. In a ...
A Hybrid BitFunnel and Partitioned Elias-Fano Inverted Index
WWW '19: The World Wide Web ConferenceSearch engines encounter a time vs. space trade-off: search responsiveness (i.e., a short query response time) comes at the cost of increased index storage. We propose a hybrid method which uses both (a) the recently published mapping-matrix-style index ...
TOPSIG: topology preserving document signatures
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementComparisons between file signatures and inverted files for text retrieval have shown the shortcomings of traditional file signatures. It has been widely accepted that traditional file signatures are inferior alternatives to inverted files. This paper ...
Comments