research-article

Building a topic hierarchy using the bag-of-related-words representation

Authors:
Rafael Geraldeli Rossi

Institute of Mathematics and Computer Science - University of São Paulo, São Carlos, Brazil

Institute of Mathematics and Computer Science - University of São Paulo, São Carlos, Brazil
View Profile

,
Solange Oliveira Rezende

Institute of Mathematics and Computer Science - University of São Paulo, São Carlos, Brazil

Institute of Mathematics and Computer Science - University of São Paulo, São Carlos, Brazil
View Profile

DocEng '11: Proceedings of the 11th ACM symposium on Document engineeringSeptember 2011Pages 195–204https://doi.org/10.1145/2034691.2034733

Published:19 September 2011Publication History

DocEng '11: Proceedings of the 11th ACM symposium on Document engineering

Pages 195–204

ABSTRACT

A simple and intuitive way to organize a huge document collection is by a topic hierarchy. Generally two steps are carried out to build a topic hierarchy automatically: 1) hierarchical document clustering and 2) cluster labeling. For both steps, a good textual document representation is essential. The bag-of-words is the common way to represent text collections. In this representation, each document is represented by a vector where each word in the document collection represents a dimension (feature). This approach has well known problems as the high dimensionality and sparsity of data. Besides, most of the concepts are composed by more than one word, as "document engineering" or "text mining". In this paper an approach called bag-of-related-words is proposed to generate features compounded by a set of related words with a dimensionality smaller than the bag-of-words. The features are extracted from each textual document of a collection using association rules. Different ways to map the document into transactions in order to allow the extraction of association rules and interest measures to prune the number of features are analyzed. To evaluate how much the proposed approach can aid the topic hierarchy building, we carried out an objective evaluation for the clustering structure, and a subjective evaluation for topic hierarchies. All the results were compared with the bag-of-words. The obtained results demonstrated that the proposed representation is better than the bag-of-words for the topic hierarchy building.

References

R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB'94: International Conference on Very Large Data Bases, pages 487--499. Morgan Kaufmann Publishers Inc., 1994. Google ScholarDigital Library
R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press / Addison-Wesley, 1999. Google ScholarDigital Library
J. Blanchard, F. Guillet, R. Gras, and H. Briand. Using information-theoretic measures to assess association rule interestingness. In ICDM'05: Internation Conference on Data Mining, pages 66--73, 2005. Google ScholarDigital Library
M. F. Caropreso, S. Matwin, and F. Sebastiani. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. Text databases & document management: theory & practice, pages 78--102, 2001. Google ScholarDigital Library
A. L. C. Carvalho, E. S. Moura, and P. Calado. Using statistical features to find phrasal terms in text collections. Journal of Information and Data Management, 1(3):583--597, 2010.Google Scholar
A. Doucet and H. Ahonen-Myka. Non-contiguous word sequences for information retrieval. In MWE'04: Workshop on Multiword Expressions: Integrating Processing, MWE'04, pages 88--95. Association for Computational Linguistics, 2004. Google ScholarDigital Library
L. Geng and H. J. Hamilton. Interestingness measures for data mining: A survey. ACM Computing Surveys, 38(3):9, 2006. Google ScholarDigital Library
F. Guillet and H. J. Hamilton, editors. Quality Measures in Data Mining, volume 43 of Studies in Computational Intelligence. Springer, 2007. Google ScholarDigital Library
V. Kashyap, C. Ramakrishnan, C. Thomas, and A. P. Sheth. Taxaminer: an experimentation framework for automated taxonomy bootstrapping. International Journal of Web and Grid Services, 1(2):240--266, 2005. Google ScholarDigital Library
Y. Lie, H. T. Loh, and W. G. Lu. Deriving taxonomy from documents at sentence level. In A. H. do Prado and E. Ferneda, editors, Emerging Technologies of Text Mining: Techniques and Applications, chapter 5, pages 99--119. Information Science Reference, 1 edition, 2007.Google Scholar
P. D. McNicholas, T. B. Murphy, and M. O'Regan. Standardising the lift of an association rule. Computational Statistics & Data Analysis, 52(10):4712--4721, 2008. Google ScholarDigital Library
D. Mladenic and M. Grobelnik. Word sequences as features in text-learning. In ERK'98: Electrotechnical and Computer Science Conference, pages 145--148, 1998.Google Scholar
M. F. Moura and S. O. Rezende. A simple method for labeling hierarchical document clusters. In IASTED'10: International Conference on Artificial Intelligence and Applications (IAI 2010), pages 363--371, 2010.Google ScholarCross Ref
G. Salton. Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc., 1989. Google ScholarDigital Library
F. F. Santos, V. O. de Carvalho, and S. O. Rezende. Selecting candidate labels for hierarchical document clusters using association rules. In Springer-Verlag, editor, MICAI'10: Mexican International Conference on Artificial Intelligence, 2010. Google ScholarDigital Library
M. V. B. Soares, R. C. Prati, and M. C. Monard. PreTexT II: Descrição da reestruturação da ferramenta de pré-processamento de textos. Technical Report 333, ICMC-USP, 2008.Google Scholar
C.-M. Tan, Y.-F. Wang, and C.-D. Lee. The use of bigrams to enhance text categorization. Information Processing and Management, 38(4):529--546, 2002. Google ScholarDigital Library
P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the right interestingness measure for association patterns. In ACM SIGKDD'2002: International Conferenceon Knowledge Discovery and Data Mining, pages 32--41. ACM, 2002. Google ScholarDigital Library
R. Tesar, V. Strnad, K. Jezek, and M. Poesio. Extending the single words-based document model: a comparison of bigrams and 2-itemsets. In DocEng'06: ACM Symposium on Document Engineering, pages 138--146, 2006. Google ScholarDigital Library
J. Wu, H. Xiong, and J. Chen. Adapting the right measures for k-means clustering. In SIGKDD'09: Proceeding of the International Conference on Knowledge Discovery and Data Mining, pages 877--886. ACM, 2009. Google ScholarDigital Library
Z. Yang, L. Zhang, J. Yan, and Z. Li. Using association features to enhance the performance of naíve bayes text classifier. In ICCIMA '03: International Conference on Computational Intelligence and Multimedia Applications, page 336. IEEE Computer Society, 2003. Google ScholarDigital Library
X. Zhang and X. Zhu. A new type of feature - loose n-gram feature in text categorization. In IbPRIA'07: Iberian Conference on Pattern Recognition and Image Analysis, pages 378--385. Springer, 2007. Google ScholarDigital Library
Y. Zhao and G. Karypis. Evaluation of hierarchical clustering algorithms for document datasets. In CIKM '02: International Conference on Information and Knowledge Management, pages 515--524. ACM Press, 2002. Google ScholarDigital Library

Index Terms

Building a topic hierarchy using the bag-of-related-words representation
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Document representation methods for clustering bilingual documents
ASIST '16: Proceedings of the 79th ASIS&T Annual Meeting: Creating Knowledge, Enhancing Lives through Information & Technology

Globalization places people in a multilingual environment. There is a growing number of users to access and share information in several languages for public or private purpose. In order to deliver relevant information in different languages, efficient ...
Read More
Document Topic Extraction Based on Wikipedia Category
CSO '11: Proceedings of the 2011 Fourth International Joint Conference on Computational Sciences and Optimization

Document Topic Extraction aims at using several key phrases to describe the topics of documents. It can be applied in web document categorization and tagging, document clusters topic description and information retrieval tasks. In this paper, we propose ...
Read More
Bag of meta-words: A novel method to represent document for the sentiment classification
Highlights
- A framework using meta-words features to represent document.
- Two models to ...
Abstract
It is crucial to represent the semantic information of a document in sentiment classification. Various semantic information representation models have been proposed, however existing approaches have their setbacks. Notable weaknesses ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DocEng '11: Proceedings of the 11th ACM symposium on Document engineering
September 2011
296 pages
ISBN:9781450308632
DOI:10.1145/2034691
Conference Chair:
Matthew Hardy
Adobe Systems, Inc., USA
,
Program Chair:
Frank Wm. Tompa
University of Waterloo, Canada
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 September 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
document representation
text mining
topic hierarchy
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate178of537submissions,33%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 289
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Building a topic hierarchy using the bag-of-related-words representation

DocEng '11: Proceedings of the 11th ACM symposium on Document engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Document representation methods for clustering bilingual documents

Document Topic Extraction Based on Wikipedia Category

Bag of meta-words: A novel method to represent document for the sentiment classification