ABSTRACT
Taxonomy construction is not only a fundamental task for semantic analysis of text corpora, but also an important step for applications such as information filtering, recommendation, and Web search. Existing pattern-based methods extract hypernym-hyponym term pairs and then organize these pairs into a taxonomy. However, by considering each term as an independent concept node, they overlook the topical proximity and the semantic correlations among terms. In this paper, we propose a method for constructing topic taxonomies, wherein every node represents a conceptual topic and is defined as a cluster of semantically coherent concept terms. Our method, TaxoGen, uses term embeddings and hierarchical clustering to construct a topic taxonomy in a recursive fashion. To ensure the quality of the recursive process, it consists of: (1) an adaptive spherical clustering module for allocating terms to proper levels when splitting a coarse topic into fine-grained ones; (2) a local embedding module for learning term embeddings that maintain strong discriminative power at different levels of the taxonomy. Our experiments on two real datasets demonstrate the effectiveness of TaxoGen compared with baseline methods.
Supplemental Material
- E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In ACM DL, pages 85--94, 2000. Google ScholarDigital Library
- L. E. Anke, J. Camacho-Collados, C. D. Bovi, and H. Saggion. Supervised distributional hypernym discovery via domain adaptation. In EMNLP, pages 424--435, 2016.Google Scholar
- M. Bansal, D. Burkett, G. de Melo, and D. Klein. Structured learning for taxonomy induction with belief propagation. In ACL, pages 1041--1051, 2014.Google ScholarCross Ref
- D. M. Blei, T. L. Griffiths, M. I. Jordan, and J. B. Tenenbaum. Hierarchical topic models and the nested chinese restaurant process. In NIPS, pages 17--24, 2003. Google ScholarDigital Library
- A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, volume 5, page 3, 2010. Google ScholarDigital Library
- P. Cimiano, A. Hotho, and S. Staab. Comparing conceptual, divisive and agglomerative clustering for learning taxonomies from text. In ECAI, pages 435--439, 2004. Google ScholarDigital Library
- B. Cui, J. Yao, G. Cong, and Y. Huang. Evolutionary taxonomy construction from dynamic tag space. In WISE, pages 105--119, 2010. Google ScholarDigital Library
- D. L. Davies and D. W. Bouldin. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell., 1(2):224--227, 1979. Google ScholarDigital Library
- I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1/2):143--175, 2001.Google ScholarDigital Library
- D. Downey, C. Bhagavatula, and Y. Yang. Efficient methods for inferring large sparse topic hierarchies. In ACL, 2015.Google ScholarCross Ref
- R. Fu, J. Guo, B. Qin, W. Che, H. Wang, and T. Liu. Learning semantic hierarchies via word embeddings. In ACL, pages 1199--1209, 2014.Google ScholarCross Ref
- G. Grefenstette. Inriasac: Simple hypernym extraction methods. In SemEval@NAACL-HLT, 2015.Google ScholarCross Ref
- M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In COLING, pages 539--545, 1992. Google ScholarDigital Library
- M. Jiang, J. Shang, T. Cassidy, X. Ren, L. M. Kaplan, T. P. Hanratty, and J. Han. Metapad: Meta pattern discovery from massive text corpora. In KDD, 2017. Google ScholarDigital Library
- Z. Kozareva and E. H. Hovy. A semi-supervised method to learn and construct taxonomies using the web. In ACL, pages 1110--1118, 2010. Google ScholarDigital Library
- R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. On semi-automated web taxonomy construction. In WebDB, pages 91--96, 2001.Google Scholar
- X. Liu, Y. Song, S. Liu, and H. Wang. Automatic taxonomy construction from keywords. In KDD, pages 1433--1441, 2012. Google ScholarDigital Library
- A. T. Luu, J. Kim, and S. Ng. Taxonomy construction using syntactic contextual evidence. In EMNLP, pages 810--819, 2014.Google Scholar
- A. T. Luu, Y. Tay, S. C. Hui, and S. Ng. Learning term embeddings for taxonomic relation identification using dynamic weighting neural network. In EMNLP, pages 403--413, 2016.Google Scholar
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111--3119, 2013. Google ScholarDigital Library
- D. M. Mimno, W. Li, and A. McCallum. Mixtures of hierarchical topics with pachinko allocation. In ICML, pages 633--640, 2007. Google ScholarDigital Library
- N. Nakashole, G. Weikum, and F. Suchanek. Patty: A taxonomy of relational patterns with semantic types. In EMNLP, pages 1135--1145, 2012. Google ScholarDigital Library
- A. Panchenko, S. Faralli, E. Ruppert, S. Remus, H. Naets, C. Fairon, S. P. Ponzetto, and C. Biemann. Taxi at semeval-2016 task 13: a taxonomy induction method based on lexico-syntactic patterns, substrings and focused crawling. In SemEval@NAACL-HLT, 2016.Google ScholarCross Ref
- S. P. Ponzetto and M. Strube. Deriving a large-scale taxonomy from wikipedia. In AAAI, 2007. Google ScholarDigital Library
- J. Seitner, C. Bizer, K. Eckert, S. Faralli, R. Meusel, H. Paulheim, and S. P. Ponzetto. A large database of hypernymy relations extracted from the web. In LREC, 2016.Google Scholar
- R. Shearer and I. Horrocks. Exploiting partial information in taxonomy construction. The Semantic Web-ISWC 2009, pages 569--584, 2009. Google ScholarDigital Library
- C. Wang, M. Danilevsky, N. Desai, Y. Zhang, P. Nguyen, T. Taula, and J. Han. A phrase mining framework for recursive construction of a topical hierarchy. In KDD, 2013. Google ScholarDigital Library
- J. Weeds, D. Clarke, J. Reffin, D. J. Weir, and B. Keller. Learning to distinguish hypernyms and co-hyponyms. In COLING, 2014.Google Scholar
- W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: A probabilistic taxonomy for text understanding. In SIGMOD, pages 481--492. ACM, 2012. Google ScholarDigital Library
- H. Yang and J. Callan. A metric-based framework for automatic taxonomy induction. In ACL, pages 271--279, 2009. Google ScholarDigital Library
- Z. Yu, H. Wang, X. Lin, and M. Wang. Learning term embeddings for hypernymy identification. In IJCAI, 2015. Google ScholarDigital Library
- Y. Zhang, A. Ahmed, V. Josifovski, and A. J. Smola. Taxonomy discovery for personalized recommendation. In WSDM, 2014. Google ScholarDigital Library
- J. Zhu, Z. Nie, X. Liu, B. Zhang, and J.-R. Wen. Statsnowball: a statistical approach to extracting entity relationships. In WWW, 2009. Google ScholarDigital Library
Index Terms
- TaxoGen: Unsupervised Topic Taxonomy Construction by Adaptive Term Embedding and Clustering
Recommendations
CoRel: Seed-Guided Topical Taxonomy Construction by Concept Learning and Relation Transferring
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningTaxonomy is not only a fundamental form of knowledge representation, but also crucial to vast knowledge-rich applications, such as question answering and web search. Most existing taxonomy construction methods extract hypernym-hyponym entity pairs to ...
SemRe-Rank: Improving Automatic Term Extraction by Incorporating Semantic Relatedness with Personalised PageRank
Automatic Term Extraction (ATE) deals with the extraction of terminology from a domain specific corpus, and has long been an established research area in data and knowledge acquisition. ATE remains a challenging task as it is known that there is no ...
Mining coherent topics in documents using word embeddings and large-scale text data
Probabilistic topic models have been extensively used to extract low-dimension aspects from document collections. However, such models without any human knowledge often generate topics that are not interpretable. Recently, a number of knowledge-based ...
Comments