Article

Ranking Text Documents Based on Conceptual Difficulty Using Term Embedding and Sequential Discourse Cohesion

WI-IAT '12: Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01December 2012Pages 145–152

Published:04 December 2012Publication History

WI-IAT '12: Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01

Pages 145–152

ABSTRACT

We propose a novel framework for determining the conceptual difficulty of a domain-specific text document without using any external lexicon. Conceptual difficulty relates to finding the reading difficulty of domain-specific documents. Previous approaches to tackling domain-specific readability problem have heavily relied upon an external lexicon, which limits the scalability to other domains. Our model can be readily applied in domain-specific vertical search engines to re-rank documents according to their conceptual difficulty. We develop an unsupervised and principled approach for computing a term's conceptual difficulty in the latent space. Our approach also considers transitions between the segments generated in sequence. It performs better than the current state-of-the-art comparative methods.

References

S. K. Bhavnani, "Domain-specific search strategies for the effective retrieval of healthcare and shopping information," in Human factors in Computing Systems, 2002, pp. 610-611. Google ScholarDigital Library
A. Broder, "A taxonomy of web search," SIGIR Forum, vol. 36, no. 2, pp. 3-10, 2002. Google ScholarDigital Library
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, "Indexing by Latent Semantic Analysis," Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407, 1990.Google ScholarDigital Library
C. Tan, E. Gabrilovich, and B. Pang, "To each his own: personalized content selection based on text comprehensibility," in Proc. of WSDM, 2012, pp. 233-242. Google ScholarDigital Library
X. Yan, D. Song, and X. Li, "Concept-based document readability in domain specific information retrieval," in Proc. of CIKM, 2006, pp. 540-549. Google ScholarDigital Library
J. Zhao and M.-Y. Kan, "Domain-specific iterative readability computation," in Proc. of JCDL, 2010, pp. 205-214. Google ScholarDigital Library
S. Jameel, W. Lam, C.-m. Au Yeung, and S. Chyan, "An unsupervised ranking method based on a technical difficulty terrain," in Proc. of CIKM, 2011, pp. 1989-1992. Google ScholarDigital Library
S. Jameel, W. Lam, X. Qian, and C.-m. Au Yeung, "An unsupervised technical difficulty ranking model based on conceptual terrain in the latent space," in Proc. of JCDL, 2012, pp. 351-352. Google ScholarDigital Library
M. W. Berry, S. T. Dumais, and G. W. O'Brien, "Using Linear Algebra for intelligent information retrieval," SIAM Review (SIREV), vol. 37, no. 4, pp. 573-595, 1995. Google ScholarDigital Library
W. H. Dubay, "The principles of readability," Costa Mesa, CA: Impact Information, 2004.Google Scholar
K. Collins-Thompson and J. Callan, "Predicting reading difficulty with statistical language models," Journal of the American Society for Information Science and Technology, vol. 56, no. 13, pp. 1448-1462, 2005. Google ScholarDigital Library
B. Bruce, A. Rubin, and K. S. Starr, "Why readability formulas fail," IEEE Transactions on Professional Communication, pp. 50-52, 1981.Google ScholarCross Ref
M. A. K. Halliday and R. Hasan, Cohesion in English (English Language). Longman Pub Group, 1976.Google Scholar
M. Bendersky, W. B. Croft, and Y. Diao, "Quality-biased ranking of web documents," in Proc. of WSDM, 2011, pp. 95-104. Google ScholarDigital Library
S. E. Schwarm and M. Ostendorf, "Reading level assessment using support vector machines and statistical language models," in Proc. of ACL, 2005, pp. 523-530. Google ScholarDigital Library
G. Kumaran, R. Jones, and O. Madani, "Biasing web search results for topic familiarity," in Proc. of CIKM, 2005, pp. 271-272. Google ScholarDigital Library
M. Heilman, K. Collins-Thompson, and M. Eskenazi, "An analysis of statistical models and features for reading difficulty prediction," in Proc. of EANL, 2008, pp. 71-79. Google ScholarDigital Library
E. Pitler and A. Nenkova, "Revisiting readability: a unified framework for predicting text quality," in Proc. of EMNLP, 2008, pp. 186-195. Google ScholarDigital Library
R. J. Kate, X. Luo, S. Patwardhan, M. Franz, R. Florian, R. J. Mooney, S. Roukos, and C. Welty, "Learning to predict readability using diverse linguistic features," in Proc. of COLING, 2010, pp. 546-554. Google ScholarDigital Library
X. Liu, W. B. Croft, P. Oh, and D. Hart, "Automatic recognition of reading levels from user queries," in Proc. of SIGIR, 2004, pp. 548-549. Google ScholarDigital Library
J. Y. Kim, K. Collins-Thompson, P. N. Bennett, and S. T. Dumais, "Characterizing web content, user interests, and search behavior by reading level and topic," in Proc. of WSDM, 2012, pp. 213-222. Google ScholarDigital Library
K. Collins-Thompson, P. N. Bennett, R. W. White, S. de la Chica, and D. Sontag, "Personalizing web search results by reading level," in Proc. of CIKM, 2011, pp. 403-412. Google ScholarDigital Library
R. Jones, R. Kumar, B. Pang, and A. Tomkins, ""I know what you did last summer": query logs and user privacy," in Proc. of CIKM, 2007, pp. 909-914. Google ScholarDigital Library
J. Morris and G. Hirst, "Lexical cohesion computed by thesaural relations as an indicator of the structure of text," Computational Linguistics, vol. 17, no. 1, pp. 21-48, 1991. Google ScholarDigital Library
G. Salton, A. Wong, and C. S. Yang, "A vector space model for automatic indexing," Communications of the ACM, vol. 18, no. 11, pp. 613-620, 1975. Google ScholarDigital Library
J. Bellegarda, "Large vocabulary speech recognition with multispan statistical language models," IEEE Transactions on Speech and Audio Processing, vol. 8, no. 1, pp. 76-84, 2000.Google ScholarCross Ref
D. Beeferman, A. Berger, and J. Lafferty, "Statistical models for text segmentation," Machine Learning, vol. 34, pp. 177-210, 1999. Google ScholarDigital Library
W. Xu, X. Liu, and Y. Gong, "Document clustering based on Non-negative Matrix Factorization," in Proc. of SIGIR, 2003, pp. 267-273. Google ScholarDigital Library
A. K. Jain, M. N. Murty, and P. J. Flynn, "Data clustering: a review," ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999. Google ScholarDigital Library
W. Kintsch, "The role of knowledge in discourse comprehension: A construction-integration model," Psychological Review, vol. 95, pp. 163-182, 1988.Google Scholar
S. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford, "Okapi at trec-3," 1996, pp. 109-126.Google Scholar
M. Nakatani, A. Jatowt, and K. Tanaka, "Adaptive ranking of search results by considering user's comprehension," in Proc. of ICUIMC, 2010, pp. 27:1-27:10. Google ScholarDigital Library
R. W. White, S. T. Dumais, and J. Teevan, "Characterizing the influence of domain expertise on web search behavior," in Proc. of WSDM, 2009, pp. 132-141. Google ScholarDigital Library
X. Yan, R. Y. Lau, D. Song, X. Li, and J. Ma, "Toward a semantic granularity model for domain-specific information retrieval," ACM Transactions on Information Systems, vol. 29, no. 3, pp. 15:1-15:46, 2011. Google ScholarDigital Library
S. T. Dumais, "Latent semantic indexing (lsi): Trec-3 report," in Overview of the Third Text REtrieval Conference, 1995, pp. 219-230.Google Scholar
J. Lofberg, "Yalmip : a toolbox for modeling and optimization in matlab," in Computer Aided Control Systems Design, 2004 IEEE International Symposium on, 2004, pp. 284-289.Google Scholar
C. Zhai and J. Lafferty, "A study of smoothing methods for language models applied to information retrieval," ACM Transactions on Information Systems, vol. 22, no. 2, pp. 179-214, 2004. Google ScholarDigital Library
P. Cai, W. Gao, A. Zhou, and K.-F. Wong, "Relevant knowledge helps in choosing right teacher: active query selection for ranking adaptation," in Proc. of SIGIR, 2011, pp. 115-124. Google ScholarDigital Library
K. Akamatsu, N. Pattanasri, A. Jatowt, and K. Tanaka, "Measuring comprehensibility of web pages based on link analysis," in Proc. of WI-IAT, vol. 1, 2011, pp. 40-46. Google ScholarDigital Library

Ranking Text Documents Based on Conceptual Difficulty Using Term Embedding and Sequential Discourse Cohesion
1. Information systems
  1. Information retrieval

Recommendations

An unsupervised technical difficulty ranking model based on conceptual terrain in the latent space
JCDL '12: Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries

Search results of the existing general-purpose search engines usually do not satisfy domain-specific information retrieval tasks as there is a mis-match between the technical expertise of a user and the results returned by the search engine. In this ...
Read More
An Unsupervised Technical Readability Ranking Model by Building a Conceptual Terrain in LSI
SKG '12: Proceedings of the 2012 Eighth International Conference on Semantics, Knowledge and Grids

Searching for domain-specific related information has gained a high popularity in recent years. Naturally, everyone is not at par with each other when it comes to knowledge about the concepts of a domain. A doctor may be well versed in her field of ...
Read More
Algorithm for documents ranking: idea and simulation results
SEKE '02: Proceedings of the 14th international conference on Software engineering and knowledge engineering

In the framework of a study, which investigated implementation of a model for displaying search results, the possibility of ranking documents that appear in a list of search results was examined. The purpose of this paper is to present the concept of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

WI-IAT '12: Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
December 2012
585 pages
ISBN:9780769548807
Sponsors
In-Cooperation
Publisher
IEEE Computer Society
United States
Publication History
- Published: 4 December 2012
Check for updates
Author Tags
Conceptual Difficulty
K-means
LSI
Term Embedding
Qualifiers
- Article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 136
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Ranking Text Documents Based on Conceptual Difficulty Using Term Embedding and Sequential Discourse Cohesion

WI-IAT '12: Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01

ABSTRACT

References

Cited By

Recommendations

An unsupervised technical difficulty ranking model based on conceptual terrain in the latent space

An Unsupervised Technical Readability Ranking Model by Building a Conceptual Terrain in LSI

Algorithm for documents ranking: idea and simulation results

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Ranking Text Documents Based on Conceptual Difficulty Using Term Embedding and Sequential Discourse Cohesion

WI-IAT '12: Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01

ABSTRACT

References

Cited By

Recommendations

An unsupervised technical difficulty ranking model based on conceptual terrain in the latent space

An Unsupervised Technical Readability Ranking Model by Building a Conceptual Terrain in LSI

Algorithm for documents ranking: idea and simulation results

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media