skip to main content
10.5555/2457524.2457675acmconferencesArticle/Chapter ViewAbstractPublication PageswiConference Proceedingsconference-collections
Article

Ranking Text Documents Based on Conceptual Difficulty Using Term Embedding and Sequential Discourse Cohesion

Published:04 December 2012Publication History

ABSTRACT

We propose a novel framework for determining the conceptual difficulty of a domain-specific text document without using any external lexicon. Conceptual difficulty relates to finding the reading difficulty of domain-specific documents. Previous approaches to tackling domain-specific readability problem have heavily relied upon an external lexicon, which limits the scalability to other domains. Our model can be readily applied in domain-specific vertical search engines to re-rank documents according to their conceptual difficulty. We develop an unsupervised and principled approach for computing a term's conceptual difficulty in the latent space. Our approach also considers transitions between the segments generated in sequence. It performs better than the current state-of-the-art comparative methods.

References

  1. S. K. Bhavnani, "Domain-specific search strategies for the effective retrieval of healthcare and shopping information," in Human factors in Computing Systems, 2002, pp. 610-611. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Broder, "A taxonomy of web search," SIGIR Forum, vol. 36, no. 2, pp. 3-10, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, "Indexing by Latent Semantic Analysis," Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407, 1990.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Tan, E. Gabrilovich, and B. Pang, "To each his own: personalized content selection based on text comprehensibility," in Proc. of WSDM, 2012, pp. 233-242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. X. Yan, D. Song, and X. Li, "Concept-based document readability in domain specific information retrieval," in Proc. of CIKM, 2006, pp. 540-549. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Zhao and M.-Y. Kan, "Domain-specific iterative readability computation," in Proc. of JCDL, 2010, pp. 205-214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Jameel, W. Lam, C.-m. Au Yeung, and S. Chyan, "An unsupervised ranking method based on a technical difficulty terrain," in Proc. of CIKM, 2011, pp. 1989-1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Jameel, W. Lam, X. Qian, and C.-m. Au Yeung, "An unsupervised technical difficulty ranking model based on conceptual terrain in the latent space," in Proc. of JCDL, 2012, pp. 351-352. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. W. Berry, S. T. Dumais, and G. W. O'Brien, "Using Linear Algebra for intelligent information retrieval," SIAM Review (SIREV), vol. 37, no. 4, pp. 573-595, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. H. Dubay, "The principles of readability," Costa Mesa, CA: Impact Information, 2004.Google ScholarGoogle Scholar
  11. K. Collins-Thompson and J. Callan, "Predicting reading difficulty with statistical language models," Journal of the American Society for Information Science and Technology, vol. 56, no. 13, pp. 1448-1462, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Bruce, A. Rubin, and K. S. Starr, "Why readability formulas fail," IEEE Transactions on Professional Communication, pp. 50-52, 1981.Google ScholarGoogle ScholarCross RefCross Ref
  13. M. A. K. Halliday and R. Hasan, Cohesion in English (English Language). Longman Pub Group, 1976.Google ScholarGoogle Scholar
  14. M. Bendersky, W. B. Croft, and Y. Diao, "Quality-biased ranking of web documents," in Proc. of WSDM, 2011, pp. 95-104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. E. Schwarm and M. Ostendorf, "Reading level assessment using support vector machines and statistical language models," in Proc. of ACL, 2005, pp. 523-530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Kumaran, R. Jones, and O. Madani, "Biasing web search results for topic familiarity," in Proc. of CIKM, 2005, pp. 271-272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Heilman, K. Collins-Thompson, and M. Eskenazi, "An analysis of statistical models and features for reading difficulty prediction," in Proc. of EANL, 2008, pp. 71-79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E. Pitler and A. Nenkova, "Revisiting readability: a unified framework for predicting text quality," in Proc. of EMNLP, 2008, pp. 186-195. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. J. Kate, X. Luo, S. Patwardhan, M. Franz, R. Florian, R. J. Mooney, S. Roukos, and C. Welty, "Learning to predict readability using diverse linguistic features," in Proc. of COLING, 2010, pp. 546-554. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. X. Liu, W. B. Croft, P. Oh, and D. Hart, "Automatic recognition of reading levels from user queries," in Proc. of SIGIR, 2004, pp. 548-549. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Y. Kim, K. Collins-Thompson, P. N. Bennett, and S. T. Dumais, "Characterizing web content, user interests, and search behavior by reading level and topic," in Proc. of WSDM, 2012, pp. 213-222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Collins-Thompson, P. N. Bennett, R. W. White, S. de la Chica, and D. Sontag, "Personalizing web search results by reading level," in Proc. of CIKM, 2011, pp. 403-412. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. Jones, R. Kumar, B. Pang, and A. Tomkins, ""I know what you did last summer": query logs and user privacy," in Proc. of CIKM, 2007, pp. 909-914. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Morris and G. Hirst, "Lexical cohesion computed by thesaural relations as an indicator of the structure of text," Computational Linguistics, vol. 17, no. 1, pp. 21-48, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. G. Salton, A. Wong, and C. S. Yang, "A vector space model for automatic indexing," Communications of the ACM, vol. 18, no. 11, pp. 613-620, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Bellegarda, "Large vocabulary speech recognition with multispan statistical language models," IEEE Transactions on Speech and Audio Processing, vol. 8, no. 1, pp. 76-84, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  27. D. Beeferman, A. Berger, and J. Lafferty, "Statistical models for text segmentation," Machine Learning, vol. 34, pp. 177-210, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. W. Xu, X. Liu, and Y. Gong, "Document clustering based on Non-negative Matrix Factorization," in Proc. of SIGIR, 2003, pp. 267-273. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. K. Jain, M. N. Murty, and P. J. Flynn, "Data clustering: a review," ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. W. Kintsch, "The role of knowledge in discourse comprehension: A construction-integration model," Psychological Review, vol. 95, pp. 163-182, 1988.Google ScholarGoogle Scholar
  31. S. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford, "Okapi at trec-3," 1996, pp. 109-126.Google ScholarGoogle Scholar
  32. M. Nakatani, A. Jatowt, and K. Tanaka, "Adaptive ranking of search results by considering user's comprehension," in Proc. of ICUIMC, 2010, pp. 27:1-27:10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. W. White, S. T. Dumais, and J. Teevan, "Characterizing the influence of domain expertise on web search behavior," in Proc. of WSDM, 2009, pp. 132-141. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. X. Yan, R. Y. Lau, D. Song, X. Li, and J. Ma, "Toward a semantic granularity model for domain-specific information retrieval," ACM Transactions on Information Systems, vol. 29, no. 3, pp. 15:1-15:46, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. T. Dumais, "Latent semantic indexing (lsi): Trec-3 report," in Overview of the Third Text REtrieval Conference, 1995, pp. 219-230.Google ScholarGoogle Scholar
  36. J. Lofberg, "Yalmip : a toolbox for modeling and optimization in matlab," in Computer Aided Control Systems Design, 2004 IEEE International Symposium on, 2004, pp. 284-289.Google ScholarGoogle Scholar
  37. C. Zhai and J. Lafferty, "A study of smoothing methods for language models applied to information retrieval," ACM Transactions on Information Systems, vol. 22, no. 2, pp. 179-214, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. P. Cai, W. Gao, A. Zhou, and K.-F. Wong, "Relevant knowledge helps in choosing right teacher: active query selection for ranking adaptation," in Proc. of SIGIR, 2011, pp. 115-124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. K. Akamatsu, N. Pattanasri, A. Jatowt, and K. Tanaka, "Measuring comprehensibility of web pages based on link analysis," in Proc. of WI-IAT, vol. 1, 2011, pp. 40-46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Ranking Text Documents Based on Conceptual Difficulty Using Term Embedding and Sequential Discourse Cohesion

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WI-IAT '12: Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
      December 2012
      585 pages
      ISBN:9780769548807

      Publisher

      IEEE Computer Society

      United States

      Publication History

      • Published: 4 December 2012

      Check for updates

      Qualifiers

      • Article
    • Article Metrics

      • Downloads (Last 12 months)1
      • Downloads (Last 6 weeks)0

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader