Abstract
Semi-structured documents are a common type of data containing free text in natural language (unstructured data) as well as additional information about the document, or meta-data, typically following a schema or controlled vocabulary (structured data). Simultaneous analysis of unstructured and structured data enables the discovery of hidden relationships that cannot be identified from either of these sources when analyzed independently of each other. In this work, we present a visual text analytics tool for semi-structured documents (ViTA-SSD), that aims to support the user in the exploration and finding of insightful patterns in a visual and interactive manner in a semi-structured collection of documents. It achieves this goal by presenting to the user a set of coordinated visualizations that allows the linking of the metadata with interactively generated clusters of documents in such a way that relevant patterns can be easily spotted. The system contains two novel approaches in its back end: a feature-learning method to learn a compact representation of the corpus and a fast-clustering approach that has been redesigned to allow user supervision. These novel contributions make it possible for the user to interact with a large and dynamic document collection and to perform several text analytical tasks more efficiently. Finally, we present two use cases that illustrate the suitability of the system for in-depth interactive exploration of semi-structured document collections, two user studies, and results of several evaluations of our text-mining components.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, Exploratory Visual Analysis and Interactive Pattern Extraction from Semi-Structured Data
- Richard Arias-Hernandez, Linda T. Kaastra, Tera Marie Green, and Brian Fisher. 2011. Pair analytics: Capturing reasoning processes in collaborative visual analytics. In 2011 44th Hawaii International Conference on System Sciences (HICSS’11). IEEE, 1--10. Google ScholarDigital Library
- Sugato Basu, Arindam Banerjee, and Raymond J. Mooney. 2002. Semi-supervised clustering by seeding. In International Conference on Machine Learning, Vol. 2. 27--34. Google ScholarDigital Library
- Yoshua Bengio. 2009. Learning deep architectures for AI. Foundations and Trends in Machine Learning 2, 1, 1--127. Google ScholarDigital Library
- Jaegul Choo, Shawn Bohn, and Haesun Park. 2009. Two-stage framework for visualization of clustered high dimensional data. In IEEE Symposium on Visual Analytics Science and Technology, 2009 (VAST’09). IEEE, 67--74.Google ScholarCross Ref
- Jaegul Choo, Changhyun Lee, Chandan K. Reddy, and Haesun Park. 2013. UTOPIAN: User-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Transactions on Visualization and Computer Graphics 19, 12, 1992--2001. Google ScholarDigital Library
- David Cohn, Rich Caruana, and Andrew McCallum. 2003. Semi-supervised clustering with user feedback. In Constrained Clustering: Advances in Algorithms, Theory, and Applications. Chapman & Hall, Boca Raton, FL, 17--32.Google Scholar
- Christopher Collins, Fernanda B. Viegas, and Martin Wattenberg. 2009. Parallel tag clouds to explore and analyze faceted text corpora. In IEEE Symposium on Visual Analytics Science and Technology, 2009 (VAST’09). IEEE, 91--98.Google ScholarCross Ref
- Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 6, 391--407.Google ScholarCross Ref
- Wenwen Dou, Xiaoyu Wang, Drew Skau, William Ribarsky, and Michelle X. Zhou. 2012. LeadLine: Interactive visual analysis of text data through event identification and exploration. In IEEE Conference on Visual Analytics Science and Technology, 2012 (VAST’12). IEEE, 93--102. Google ScholarDigital Library
- Dumitru Erhan, Yoshua Bengio, Pierre-Antoine Courville, Aaronand Manzagol, Pascal Vincent, and Samy Bengio. 2010. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research 11, 625--660. Google ScholarDigital Library
- Felice C. Frankel and Angela H. DePace. 2012. Visual Strategies: a Practical Guide to Graphics for Scientists & Engineers. Yale University Press, New Haven, CT.Google Scholar
- Carsten Gorg, Zhicheng Liu, Jaeyeon Kihm, Jaegul Choo, Haesun Park, and John Stasko. 2013. Combining computational analyses and interactive visualization for document exploration and sensemaking in jigsaw. IEEE Transactions on Visualization and Computer Graphics, 19, 10, 1646--1663.Google ScholarCross Ref
- David Gotz and Michelle X. Zhou. 2009. Characterizing users’ visual analytic activity for insight provenance. Information Visualization 8, 1, 42--55. Google ScholarDigital Library
- Michelle L. Gregory, Deborah Payne, David McColgin, Nicolas Cramer, and Douglas Love. 2007. Visual analysis of weblog content. In International Conference on Weblogs and Social Media.Google Scholar
- Geoffrey E. Hinton and Ruslan R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786, 504--507.Google Scholar
- Yeming Hu, Evangelos E. Milios, and James Blustein. 2012. Enhancing semi-supervised document clustering with feature supervision. In Proceedings of the 27th Annual ACM Symposium on Applied Computing. ACM, New York, NY, 929--936. Google ScholarDigital Library
- Shiping Huang, Matthew O. Ward, and Elke A. Rundensteiner. 2005. Exploration of dimensionality reduction for text visualization. In Proceedings of the 3rd International Conference on Coordinated and Multiple Views in Exploratory Visualization, 2005 (CMV2’05). IEEE, 63--74. Google ScholarDigital Library
- Akihiro Inokuchi and Koichi Takeda. 2007. A method for online analytical processing of text data. In Proceedings of the 16th ACM Conference on Conference on Information and Knowledge Management (CIKM’07), Vol. 7. 455--464. Google ScholarDigital Library
- Nazanin Kadivar, Victor Chen, Dustin Dunsmuir, Eric Lee, Cheryl Qian, John Dill, Christopher Shaw, and R. Woodbury. 2009. Capturing and supporting the analysis process. In IEEE Symposium on Visual Analytics Science and Technology, 2009 (VAST’09). IEEE, 131--138.Google Scholar
- Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2012. Enterprise data analysis and visualization: An interview study. In IEEE Conference on Visual Analytics Science and Technology, 2012 (VAST’12).Google ScholarDigital Library
- Youn-ah Kang, C. Gorg, and John Stasko. 2009. Evaluating visual analytics systems for investigative analysis: Deriving design principles from a case study. In IEEE Symposium on Visual Analytics Science and Technology, 2009 (VAST’09). IEEE, 139--146.Google Scholar
- Youn-ah Kang and John Stasko. 2012. Examining the use of a visual analytics system for sensemaking tasks: Case studies with domain experts. IEEE Transactions on Visualization and Computer Graphics, 18, 12, 2869--2878. Google ScholarDigital Library
- Anne Kao, Stephen Poteet, and David Augustine. 2011. Extracting critical information from free text data for systems health management. In Machine Learning and Knowledge Discovery for Engineering Systems Health Management, Ashok N. Srivastava and Jiawei Han (Eds.). CRC Press, Boca Raton, FL, 423--450.Google Scholar
- Daniel A. Keim, Jörn Kohlhammer, Geoffrey Ellis, and Florian Mansmann. 2010. Mastering the Information Age-Solving Problems with Visual Analytics. Eurographics Association, Goslar, Germany.Google Scholar
- Heidi Lam, Enrico Bertini, Petra Isenberg, Catherine Plaisant, and Sheelagh Carpendale. 2012. Empirical studies in information visualization: Seven scenarios. IEEE Transactions on Visualization and Computer Graphics, 18, 9, 1520--1536. Google ScholarDigital Library
- Hanseung Lee, Jaeyeon Kihm, Jaegul Choo, John Stasko, and Haesun Park. 2012. iVisClustering: An interactive visual document clustering via topic modeling. Computer Graphics Forum 31, 3, 1155--1164. Google ScholarDigital Library
- John A. Lee and Michel Verleysen. 2009. Quality assessment of dimensionality reduction: Rank-based criteria. Neurocomputing 72, 7--9, 1431--1443. Google ScholarDigital Library
- Bing Liu. 2012. Sentiment Analysis and Opinion Mining. Morgan & Claypool, San Francisco, CA.Google Scholar
- Shixia Liu, Michelle X. Zhou, Shimei Pan, Yangqiu Song, Weihong Qian, Weijia Cai, and Xiaoxiao Lian. 2012. TIARA: Interactive, topic-based visual text summarization and analysis. ACM Transactions on Intelligent System Technologies 3, 2. Google ScholarDigital Library
- Yujie Liu, Scott Barlowe, Yaqin Feng, Jing Yang, and Min Jiang. 2013. Evaluating exploratory visualization systems: A user study on how clustering-based visualization systems support information seeking from large document collections. Information Visualization 12, 1, 25--43. Google ScholarDigital Library
- Steffen Lohmann, Jürgen Ziegler, and Lena Tetzlaff. 2009. Comparison of tag cloud layouts: Task-related performance and visual exploration. In Human--Computer Interaction--INTERACT 2009. Springer, 392--404. Google ScholarDigital Library
- Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Vol. 1. Cambridge University Press, Cambridge, UK. Google Scholar
- Karl Pearson. 1900. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 50, 302, 157--175.Google ScholarCross Ref
- John Risch, Anne Kao, Stephen Poteet, and Y. Wu. 2008. Text visualization for visual text analytics. Visual Data Mining, 154--171.Google Scholar
- D. Sculley. 2010. Web-scale k-means clustering. In Proceedings of the 19th International Conference on World Wide Web. ACM, New York, NY, 1177--1178. Google ScholarDigital Library
- Ajitesh Srivastava, Axel J. Soto, and Evangelos Milios. 2013. A graph-based topic extraction method enabling simple interactive customization. In Proceedings of the 2013 ACM Symposium on Document Engineering. ACM, New York, NY, 71--80. Google ScholarDigital Library
- Marc Strickert, Axel J. Soto, and Gustavo E. Vazquez. 2010. Adaptive matrix distances aiming at optimum regression subspaces. In Proceedings of the European Symposium on Artificial Neural Networks. D-facto Publications, 93--98.Google Scholar
- Laurens van der Maaten. 2009. Learning a parametric embedding by preserving local structure. Journal of Machine Learning Research—Proceedings Track 5, 384--391.Google Scholar
- L. J. P. van der Maaten, E. O. Postma, and H. J. van den Herik. 2009. Dimensionality Reduction: A Comparative Review. Technical Report. TiCC-TR 2009-005. Tilburg University, Tilburg, The Netherlands.Google Scholar
- Jarke van Wijk, Tobias Isenberg, Jos B. T. M. Roerdink, Alexandru C. Telea, and Michel Westenberg. 2010. Evaluation. In Mastering the Information Age-Solving Problems with Visual Analytics, Daniel A Keim, Jörn Kohlhammer, Geoffrey Ellis, and Florian Mansmann (Eds.). Eurographics Association, Goslar, Germany.Google Scholar
- Jarkko Venna and Samuel Kaski. 2001. Neighborhood preservation in nonlinear projection methods: An experimental study. In Proceedings of the International Conference on Artificial Neural Networks. Springer-Verlag, London, 485--491. Google ScholarDigital Library
- Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2009. Information theoretic measures for clusterings comparison: Is a correction for chance necessary? In Proceedings of the 26th Annual International Conference on Machine Learning. ACM, New York, NY, 1073--1080. Google ScholarDigital Library
- Hongning Wang, Yue Lu, and Chengxiang Zhai. 2010. Latent aspect rating analysis on review text data: A rating regression approach. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 783--792. Google ScholarDigital Library
- Daniela M. Witten and Robert Tibshirani. 2010. A framework for feature selection in clustering. Journal of the Amererican Statistical Association 105, 490.Google Scholar
- Yingcai Wu, Furu Wei, Shixia Liu, Norman Au, Weiwei Cui, Hong Zhou, and Huamin Qu. 2010. OpinionSeer: Interactive visualization of hotel customer feedback. IEEE Transactions on Visualization and Computer Graphics 16, 6, 1109--1118. Google ScholarDigital Library
- Duo Zhang, Chengxiang Zhai, and Jiawei Han. 2009. Topic cube: Topic modeling for OLAP on multidimensional text databases. In Proceedings of the 2009 SIAM International Conference on Data Mining (SDM’09). 1123--1134.Google ScholarCross Ref
- Leishi Zhang, Andreas Stoffel, Michael Behrisch, Sebastian Mittelstadt, Tobias Schreck, René Pompl, Stefan Weber, Holger Last, and Daniel Keim. 2012. Visual analytics for the big data era—comparative review of state-of-the-art commercial systems. In IEEE Conference on Visual Analytics Science and Technology (VAST’12). IEEE, 173--182. Google ScholarDigital Library
- Jian Zhao, Christopher Collins, Fanny Chevalier, and Ravin Balakrishnan. 2013. Interactive exploration of implicit and explicit relations in faceted datasets. IEEE Transactions on Visualization and Computer Graphics 19, 12, 2080--2089. Google ScholarDigital Library
- Weizhong Zhu and Chaomei Chen. 2007. Storylines: Visual exploration and analysis in latent semantic spaces. Computers and Graphics 31, 3, 338--349. Google ScholarDigital Library
Index Terms
- Exploratory Visual Analysis and Interactive Pattern Extraction from Semi-Structured Data
Recommendations
TIARA: a visual exploratory text analytic system
KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data miningIn this paper, we present a novel exploratory visual analytic system called TIARA (Text Insight via Automated Responsive Analytics), which combines text analytics and interactive visualization to help users explore and analyze large collections of text. ...
Visual content correlation analysis
IVITA '10: Proceedings of the first international workshop on Intelligent visual interfaces for text analysisCorrelating content from multiple data fields is one of the key challenges in text mining. In this paper, we propose a visual analytics approach that leverages both content correlation analysis and interactive visualization technologies in analyzing and ...
Logical structure based semantic relationship extraction from semi-structured documents
WWW '06: Proceedings of the 15th international conference on World Wide WebAddressed in this paper is the issue of semantic relationship extraction from semi-structured documents. Many research efforts have been made so far on the semantic information extraction. However, much of the previous work focuses on detecting `...
Comments