ABSTRACT
We propose a novel framework for determining the conceptual difficulty of a domain-specific text document without using any external lexicon. Conceptual difficulty relates to finding the reading difficulty of domain-specific documents. Previous approaches to tackling domain-specific readability problem have heavily relied upon an external lexicon, which limits the scalability to other domains. Our model can be readily applied in domain-specific vertical search engines to re-rank documents according to their conceptual difficulty. We develop an unsupervised and principled approach for computing a term's conceptual difficulty in the latent space. Our approach also considers transitions between the segments generated in sequence. It performs better than the current state-of-the-art comparative methods.
- S. K. Bhavnani, "Domain-specific search strategies for the effective retrieval of healthcare and shopping information," in Human factors in Computing Systems, 2002, pp. 610-611. Google ScholarDigital Library
- A. Broder, "A taxonomy of web search," SIGIR Forum, vol. 36, no. 2, pp. 3-10, 2002. Google ScholarDigital Library
- S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, "Indexing by Latent Semantic Analysis," Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407, 1990.Google ScholarDigital Library
- C. Tan, E. Gabrilovich, and B. Pang, "To each his own: personalized content selection based on text comprehensibility," in Proc. of WSDM, 2012, pp. 233-242. Google ScholarDigital Library
- X. Yan, D. Song, and X. Li, "Concept-based document readability in domain specific information retrieval," in Proc. of CIKM, 2006, pp. 540-549. Google ScholarDigital Library
- J. Zhao and M.-Y. Kan, "Domain-specific iterative readability computation," in Proc. of JCDL, 2010, pp. 205-214. Google ScholarDigital Library
- S. Jameel, W. Lam, C.-m. Au Yeung, and S. Chyan, "An unsupervised ranking method based on a technical difficulty terrain," in Proc. of CIKM, 2011, pp. 1989-1992. Google ScholarDigital Library
- S. Jameel, W. Lam, X. Qian, and C.-m. Au Yeung, "An unsupervised technical difficulty ranking model based on conceptual terrain in the latent space," in Proc. of JCDL, 2012, pp. 351-352. Google ScholarDigital Library
- M. W. Berry, S. T. Dumais, and G. W. O'Brien, "Using Linear Algebra for intelligent information retrieval," SIAM Review (SIREV), vol. 37, no. 4, pp. 573-595, 1995. Google ScholarDigital Library
- W. H. Dubay, "The principles of readability," Costa Mesa, CA: Impact Information, 2004.Google Scholar
- K. Collins-Thompson and J. Callan, "Predicting reading difficulty with statistical language models," Journal of the American Society for Information Science and Technology, vol. 56, no. 13, pp. 1448-1462, 2005. Google ScholarDigital Library
- B. Bruce, A. Rubin, and K. S. Starr, "Why readability formulas fail," IEEE Transactions on Professional Communication, pp. 50-52, 1981.Google ScholarCross Ref
- M. A. K. Halliday and R. Hasan, Cohesion in English (English Language). Longman Pub Group, 1976.Google Scholar
- M. Bendersky, W. B. Croft, and Y. Diao, "Quality-biased ranking of web documents," in Proc. of WSDM, 2011, pp. 95-104. Google ScholarDigital Library
- S. E. Schwarm and M. Ostendorf, "Reading level assessment using support vector machines and statistical language models," in Proc. of ACL, 2005, pp. 523-530. Google ScholarDigital Library
- G. Kumaran, R. Jones, and O. Madani, "Biasing web search results for topic familiarity," in Proc. of CIKM, 2005, pp. 271-272. Google ScholarDigital Library
- M. Heilman, K. Collins-Thompson, and M. Eskenazi, "An analysis of statistical models and features for reading difficulty prediction," in Proc. of EANL, 2008, pp. 71-79. Google ScholarDigital Library
- E. Pitler and A. Nenkova, "Revisiting readability: a unified framework for predicting text quality," in Proc. of EMNLP, 2008, pp. 186-195. Google ScholarDigital Library
- R. J. Kate, X. Luo, S. Patwardhan, M. Franz, R. Florian, R. J. Mooney, S. Roukos, and C. Welty, "Learning to predict readability using diverse linguistic features," in Proc. of COLING, 2010, pp. 546-554. Google ScholarDigital Library
- X. Liu, W. B. Croft, P. Oh, and D. Hart, "Automatic recognition of reading levels from user queries," in Proc. of SIGIR, 2004, pp. 548-549. Google ScholarDigital Library
- J. Y. Kim, K. Collins-Thompson, P. N. Bennett, and S. T. Dumais, "Characterizing web content, user interests, and search behavior by reading level and topic," in Proc. of WSDM, 2012, pp. 213-222. Google ScholarDigital Library
- K. Collins-Thompson, P. N. Bennett, R. W. White, S. de la Chica, and D. Sontag, "Personalizing web search results by reading level," in Proc. of CIKM, 2011, pp. 403-412. Google ScholarDigital Library
- R. Jones, R. Kumar, B. Pang, and A. Tomkins, ""I know what you did last summer": query logs and user privacy," in Proc. of CIKM, 2007, pp. 909-914. Google ScholarDigital Library
- J. Morris and G. Hirst, "Lexical cohesion computed by thesaural relations as an indicator of the structure of text," Computational Linguistics, vol. 17, no. 1, pp. 21-48, 1991. Google ScholarDigital Library
- G. Salton, A. Wong, and C. S. Yang, "A vector space model for automatic indexing," Communications of the ACM, vol. 18, no. 11, pp. 613-620, 1975. Google ScholarDigital Library
- J. Bellegarda, "Large vocabulary speech recognition with multispan statistical language models," IEEE Transactions on Speech and Audio Processing, vol. 8, no. 1, pp. 76-84, 2000.Google ScholarCross Ref
- D. Beeferman, A. Berger, and J. Lafferty, "Statistical models for text segmentation," Machine Learning, vol. 34, pp. 177-210, 1999. Google ScholarDigital Library
- W. Xu, X. Liu, and Y. Gong, "Document clustering based on Non-negative Matrix Factorization," in Proc. of SIGIR, 2003, pp. 267-273. Google ScholarDigital Library
- A. K. Jain, M. N. Murty, and P. J. Flynn, "Data clustering: a review," ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999. Google ScholarDigital Library
- W. Kintsch, "The role of knowledge in discourse comprehension: A construction-integration model," Psychological Review, vol. 95, pp. 163-182, 1988.Google Scholar
- S. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford, "Okapi at trec-3," 1996, pp. 109-126.Google Scholar
- M. Nakatani, A. Jatowt, and K. Tanaka, "Adaptive ranking of search results by considering user's comprehension," in Proc. of ICUIMC, 2010, pp. 27:1-27:10. Google ScholarDigital Library
- R. W. White, S. T. Dumais, and J. Teevan, "Characterizing the influence of domain expertise on web search behavior," in Proc. of WSDM, 2009, pp. 132-141. Google ScholarDigital Library
- X. Yan, R. Y. Lau, D. Song, X. Li, and J. Ma, "Toward a semantic granularity model for domain-specific information retrieval," ACM Transactions on Information Systems, vol. 29, no. 3, pp. 15:1-15:46, 2011. Google ScholarDigital Library
- S. T. Dumais, "Latent semantic indexing (lsi): Trec-3 report," in Overview of the Third Text REtrieval Conference, 1995, pp. 219-230.Google Scholar
- J. Lofberg, "Yalmip : a toolbox for modeling and optimization in matlab," in Computer Aided Control Systems Design, 2004 IEEE International Symposium on, 2004, pp. 284-289.Google Scholar
- C. Zhai and J. Lafferty, "A study of smoothing methods for language models applied to information retrieval," ACM Transactions on Information Systems, vol. 22, no. 2, pp. 179-214, 2004. Google ScholarDigital Library
- P. Cai, W. Gao, A. Zhou, and K.-F. Wong, "Relevant knowledge helps in choosing right teacher: active query selection for ranking adaptation," in Proc. of SIGIR, 2011, pp. 115-124. Google ScholarDigital Library
- K. Akamatsu, N. Pattanasri, A. Jatowt, and K. Tanaka, "Measuring comprehensibility of web pages based on link analysis," in Proc. of WI-IAT, vol. 1, 2011, pp. 40-46. Google ScholarDigital Library
- Ranking Text Documents Based on Conceptual Difficulty Using Term Embedding and Sequential Discourse Cohesion
Recommendations
An unsupervised technical difficulty ranking model based on conceptual terrain in the latent space
JCDL '12: Proceedings of the 12th ACM/IEEE-CS joint conference on Digital LibrariesSearch results of the existing general-purpose search engines usually do not satisfy domain-specific information retrieval tasks as there is a mis-match between the technical expertise of a user and the results returned by the search engine. In this ...
An Unsupervised Technical Readability Ranking Model by Building a Conceptual Terrain in LSI
SKG '12: Proceedings of the 2012 Eighth International Conference on Semantics, Knowledge and GridsSearching for domain-specific related information has gained a high popularity in recent years. Naturally, everyone is not at par with each other when it comes to knowledge about the concepts of a domain. A doctor may be well versed in her field of ...
Algorithm for documents ranking: idea and simulation results
SEKE '02: Proceedings of the 14th international conference on Software engineering and knowledge engineeringIn the framework of a study, which investigated implementation of a model for displaying search results, the possibility of ranking documents that appear in a list of search results was examined. The purpose of this paper is to present the concept of ...
Comments