ABSTRACT
Prior research in resource selection for federated search mainly focused on selecting a small number of information sources that are most relevant to a user query. However, result novelty and diversification are largely unexplored, which does not reflect the various kinds of information needs of users in real world applications.
This paper proposes two general approaches to model both result relevance and diversification in selecting sources, in order to provide more comprehensive coverage of multiple aspects of a user query. The first approach focuses on diversifying the document ranking on a centralized sample database before selecting information sources under the framework of Relevant Document Distribution Estimation (ReDDE). The second approach first evaluates the relevance of information sources with respect to each aspect of the query, and then ranks the sources based on the novelty and relevance that they offer. Both approaches can be applied with a wide range of existing resource selection algorithms such as ReDDE, CRCS, CORI and Big Document. Moreover, this paper proposes a learning based approach to combine multiple resource selection algorithms for result diversification, which can further improve the performance. We propose a set of new metrics for resource selection in federated search to evaluate the diversification performance of different approaches. To our best knowledge, this is the first piece of work that addresses the problem of search result diversification in federated search. The effectiveness of the proposed approaches has been demonstrated by an extensive set of experiments on the federated search testbed of the Clueweb dataset.
- R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong. Diversifying search results. In Proceedings of the Second ACM International Conference on Web Search and Data Mining pages 5--14, 2009. Google ScholarDigital Library
- J. Arguello, J. Callan, and F. Diaz. Classification-based resource selection. CIKM'09, pages 1277--1286, 2009. Google ScholarDigital Library
- M. Baillie, M. Carman, and F. Crestani. A multi-collection latent topic model for federated search. Information Retrieval, 14(4):390--412, 2011. Google ScholarDigital Library
- J. Callan. Distributed information retrieval. Advances in Information Retrieval, pages 127--150, 2000.Google Scholar
- J. Carbonell and J. Goldstein. The use of MMR, diversity-based reranking for reordering documentsand producing summaries. In SIGIR'98, pages 335--336, 1998. Google ScholarDigital Library
- B. Carterette and P. Chandar. Probabilistic models of ranking novel documents for faceted topicretrieval. In CIKM'09, pages 1287--1296, 2009. Google ScholarDigital Library
- O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. In CIKM'09, pages 621--630. ACM, 2009. Google ScholarDigital Library
- H. Chen and D. Karger. Less is more: probabilistic models for retrieving fewer relevant documents. In SIGIR'06, pages 429--436, 2006. Google ScholarDigital Library
- C. Clarke, N. Craswell, and I. Soboroff. Overview of the TREC 2009 Web Track. TREC, pages 1--9, Jan. 2009.Google Scholar
- C. Clarke, N. Craswell, I. Soboroff, and G. V. Cormack. Overview of the TREC 2010 Web Track. TREC, pages 1--9, Jan. 2010.Google Scholar
- C. Clarke, N. Craswell, I. Soboroff, and E. Voorhees. Overview of the TREC 2011 Web Track. pages 1--9, Jan. 2011.Google Scholar
- C. Clarke, M. Kolla, and O. Vechtomova. An effectiveness measure for ambiguous and underspecified queries. Advances in Information Retrieval Theory, pages 188--199, 2009. Google ScholarDigital Library
- C. L. A. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon. Novelty and diversity in information retrieval evaluation. In SIGIR'08, pages 659--666, 2008. Google ScholarDigital Library
- N. Craswell. Methods for Distributed Information Retrieval. PhD thesis, The Australian National University, 2000.Google Scholar
- F. Crestani and I. Markov. Distributed Information Retrieval and Applications. In Proceedings of ECIR, Jan. 2013. Google ScholarDigital Library
- V. Dang and W. B. Croft. Diversity by proportionality: an election-based approach to search result diversification. In SIGIR'12, pages 65--74. ACM, 2012. Google ScholarDigital Library
- N. Fuhr. Resource Discovery in Distributed Digital Libraries. In In Digital Libraries '99: Advanced Methods and Technologies, Digital Collections, 1999.Google Scholar
- A. Genkin, D. D. Lewis, and D. Madigan. Large-scale Bayesian logistic regression for text categorization. Technometrics, 49(3):291--304, 2007.Google ScholarCross Ref
- J. He, V. Hollink, and A. de Vries. Combining implicit and explicit topic representations for result diversification. In SIGIR'12, pages 851--860. ACM, 2012. Google ScholarDigital Library
- D. Hong, L. Si, P. Bracke, M. Witt, and T. Juchcinski. A joint probabilistic classification model for resource selection. SIGIR'10, pages 98--105, 2010. Google ScholarDigital Library
- A. Kulkarni and J. Callan. Document allocation policies for selective searching of distributed indexes. CIKM'10, pages 449--458, 2010. Google ScholarDigital Library
- I. Markov, L. Azzopardi, and F. Crestani. Reducing the Uncertainty in Resource Selection. In Proceedings of ECIR, 2013. Google ScholarDigital Library
- D. Metzler and W. B. Croft. Combining the language model and inference network approaches to retrieval. Information Processing and Management, 40(5):735--750, 2004. Google ScholarDigital Library
- D. Nguyen, T. Demeester, D. Trieschnigg, and D. Hiemstra. Federated Search in the Wild. In CIKM '12, pages 1874--1878, 2012. Google ScholarDigital Library
- R. L. Santos, C. Macdonald, and I. Ounis. Aggregated search result diversification. Advances in Information Retrieval Theory, pages 250--261, 2011. Google ScholarDigital Library
- R. L. T. Santos, C. Macdonald, and I. Ounis. Exploiting query reformulations for web search result diversification. In Proceedings of the 19th international conference on World wide web, pages 881--890. ACM, 2010. Google ScholarDigital Library
- M. Shokouhi. Central-rank-based collection selection in uncooperative distributed information retrieval. Advances in Information Retrieval, 2007. Google ScholarDigital Library
- M. Shokouhi and L. Si. Federated Search. 2011.Google Scholar
- M. Shokouhi and J. Zobel. Federated Text Retrieval From Uncooperative Overlapped Collections. SIGIR'07, pages 789--790, 2007. Google ScholarDigital Library
- M. Shokouhi and J. Zobel. Robust result merging using sample-based score estimates. ACM Transactions on Information Systems (TOIS), 27(3):1--29, 2009. Google ScholarDigital Library
- L. Si and J. Callan. A semisupervised learning method to merge search engine results. ACM Transactions on Information Systems (TOIS), 21(4):457--491, 2003. Google ScholarDigital Library
- L. Si and J. Callan. Relevant document distribution estimation method for resource selection. SIGIR'03, pages 298--305, 2003. Google ScholarDigital Library
- P. Thomas and M. Shokouhi. Sushi: Scoring scaled samples for server selection. In SIGIR'09, pages 419--426. ACM, 2009. Google ScholarDigital Library
- D. Vallet and P. Castells. Personalized diversification of search results. In SIGIR'12, pages 841--850. ACM, 2012. Google ScholarDigital Library
- S. Vargas, P. Castells, and D. Vallet. Explicit relevance models in intent-oriented information retrieval diversification. In SIGIR'12, pages 75--84. ACM, 2012. Google ScholarDigital Library
- J. Xu and W. B. Croft. Cluster-based language models for distributed retrieval. In SIGIR'99, pages 254--261, 1999. Google ScholarDigital Library
- B. Yuwono and D. L. Lee. Server ranking for distributed text retrieval systems on the internet. In Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA), pages 41--50, 1997. Google ScholarDigital Library
- C. X. Zhai, W. Cohen, and J. Lafferty. Beyond independent relevance: methods and evaluation metrics for subtopic retrieval. In SIGIR'03, pages 10--17, 2003. Google ScholarDigital Library
- K. Zhou, R. Cummins, M. Lalmas, and J. M. Jose. Evaluating aggregated search pages. In SIGIR'12, pages 115--124, 2012. Google ScholarDigital Library
Index Terms
- Search result diversification in resource selection for federated search
Recommendations
Source selection of long tail sources for federated search in an uncooperative setting
SAC '18: Proceedings of the 33rd Annual ACM Symposium on Applied ComputingThe goal of federated search is to combine results from multiple knowledge bases into a single, aggregated result list with items typically ranging from textual documents to images. These knowledge bases are also termed sources, and the process of ...
Intent-based diversification of web search results: metrics and algorithms
We study the problem of web search result diversification in the case where intent based relevance scores are available. A diversified search result will hopefully satisfy the information need of user-L.s who may have different intents. In this context, ...
An exploration of pattern-based subtopic modeling for search result diversification
JCDL '11: Proceedings of the 11th annual international ACM/IEEE joint conference on Digital librariesTraditional information retrieval models do not necessarily provide users with optimal search experience because the top ranked documents may contain the same piece of relevant information, i.e., the same subtopic of a query. The goal of search result ...
Comments