skip to main content
research-article
Best Paper

Exploratory Visual Analysis and Interactive Pattern Extraction from Semi-Structured Data

Published:08 September 2015Publication History
Skip Abstract Section

Abstract

Semi-structured documents are a common type of data containing free text in natural language (unstructured data) as well as additional information about the document, or meta-data, typically following a schema or controlled vocabulary (structured data). Simultaneous analysis of unstructured and structured data enables the discovery of hidden relationships that cannot be identified from either of these sources when analyzed independently of each other. In this work, we present a visual text analytics tool for semi-structured documents (ViTA-SSD), that aims to support the user in the exploration and finding of insightful patterns in a visual and interactive manner in a semi-structured collection of documents. It achieves this goal by presenting to the user a set of coordinated visualizations that allows the linking of the metadata with interactively generated clusters of documents in such a way that relevant patterns can be easily spotted. The system contains two novel approaches in its back end: a feature-learning method to learn a compact representation of the corpus and a fast-clustering approach that has been redesigned to allow user supervision. These novel contributions make it possible for the user to interact with a large and dynamic document collection and to perform several text analytical tasks more efficiently. Finally, we present two use cases that illustrate the suitability of the system for in-depth interactive exploration of semi-structured document collections, two user studies, and results of several evaluations of our text-mining components.

Skip Supplemental Material Section

Supplemental Material

References

  1. Richard Arias-Hernandez, Linda T. Kaastra, Tera Marie Green, and Brian Fisher. 2011. Pair analytics: Capturing reasoning processes in collaborative visual analytics. In 2011 44th Hawaii International Conference on System Sciences (HICSS’11). IEEE, 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Sugato Basu, Arindam Banerjee, and Raymond J. Mooney. 2002. Semi-supervised clustering by seeding. In International Conference on Machine Learning, Vol. 2. 27--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Yoshua Bengio. 2009. Learning deep architectures for AI. Foundations and Trends in Machine Learning 2, 1, 1--127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jaegul Choo, Shawn Bohn, and Haesun Park. 2009. Two-stage framework for visualization of clustered high dimensional data. In IEEE Symposium on Visual Analytics Science and Technology, 2009 (VAST’09). IEEE, 67--74.Google ScholarGoogle ScholarCross RefCross Ref
  5. Jaegul Choo, Changhyun Lee, Chandan K. Reddy, and Haesun Park. 2013. UTOPIAN: User-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Transactions on Visualization and Computer Graphics 19, 12, 1992--2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. David Cohn, Rich Caruana, and Andrew McCallum. 2003. Semi-supervised clustering with user feedback. In Constrained Clustering: Advances in Algorithms, Theory, and Applications. Chapman & Hall, Boca Raton, FL, 17--32.Google ScholarGoogle Scholar
  7. Christopher Collins, Fernanda B. Viegas, and Martin Wattenberg. 2009. Parallel tag clouds to explore and analyze faceted text corpora. In IEEE Symposium on Visual Analytics Science and Technology, 2009 (VAST’09). IEEE, 91--98.Google ScholarGoogle ScholarCross RefCross Ref
  8. Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 6, 391--407.Google ScholarGoogle ScholarCross RefCross Ref
  9. Wenwen Dou, Xiaoyu Wang, Drew Skau, William Ribarsky, and Michelle X. Zhou. 2012. LeadLine: Interactive visual analysis of text data through event identification and exploration. In IEEE Conference on Visual Analytics Science and Technology, 2012 (VAST’12). IEEE, 93--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dumitru Erhan, Yoshua Bengio, Pierre-Antoine Courville, Aaronand Manzagol, Pascal Vincent, and Samy Bengio. 2010. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research 11, 625--660. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Felice C. Frankel and Angela H. DePace. 2012. Visual Strategies: a Practical Guide to Graphics for Scientists & Engineers. Yale University Press, New Haven, CT.Google ScholarGoogle Scholar
  12. Carsten Gorg, Zhicheng Liu, Jaeyeon Kihm, Jaegul Choo, Haesun Park, and John Stasko. 2013. Combining computational analyses and interactive visualization for document exploration and sensemaking in jigsaw. IEEE Transactions on Visualization and Computer Graphics, 19, 10, 1646--1663.Google ScholarGoogle ScholarCross RefCross Ref
  13. David Gotz and Michelle X. Zhou. 2009. Characterizing users’ visual analytic activity for insight provenance. Information Visualization 8, 1, 42--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Michelle L. Gregory, Deborah Payne, David McColgin, Nicolas Cramer, and Douglas Love. 2007. Visual analysis of weblog content. In International Conference on Weblogs and Social Media.Google ScholarGoogle Scholar
  15. Geoffrey E. Hinton and Ruslan R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786, 504--507.Google ScholarGoogle Scholar
  16. Yeming Hu, Evangelos E. Milios, and James Blustein. 2012. Enhancing semi-supervised document clustering with feature supervision. In Proceedings of the 27th Annual ACM Symposium on Applied Computing. ACM, New York, NY, 929--936. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Shiping Huang, Matthew O. Ward, and Elke A. Rundensteiner. 2005. Exploration of dimensionality reduction for text visualization. In Proceedings of the 3rd International Conference on Coordinated and Multiple Views in Exploratory Visualization, 2005 (CMV2’05). IEEE, 63--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Akihiro Inokuchi and Koichi Takeda. 2007. A method for online analytical processing of text data. In Proceedings of the 16th ACM Conference on Conference on Information and Knowledge Management (CIKM’07), Vol. 7. 455--464. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Nazanin Kadivar, Victor Chen, Dustin Dunsmuir, Eric Lee, Cheryl Qian, John Dill, Christopher Shaw, and R. Woodbury. 2009. Capturing and supporting the analysis process. In IEEE Symposium on Visual Analytics Science and Technology, 2009 (VAST’09). IEEE, 131--138.Google ScholarGoogle Scholar
  20. Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2012. Enterprise data analysis and visualization: An interview study. In IEEE Conference on Visual Analytics Science and Technology, 2012 (VAST’12).Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Youn-ah Kang, C. Gorg, and John Stasko. 2009. Evaluating visual analytics systems for investigative analysis: Deriving design principles from a case study. In IEEE Symposium on Visual Analytics Science and Technology, 2009 (VAST’09). IEEE, 139--146.Google ScholarGoogle Scholar
  22. Youn-ah Kang and John Stasko. 2012. Examining the use of a visual analytics system for sensemaking tasks: Case studies with domain experts. IEEE Transactions on Visualization and Computer Graphics, 18, 12, 2869--2878. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Anne Kao, Stephen Poteet, and David Augustine. 2011. Extracting critical information from free text data for systems health management. In Machine Learning and Knowledge Discovery for Engineering Systems Health Management, Ashok N. Srivastava and Jiawei Han (Eds.). CRC Press, Boca Raton, FL, 423--450.Google ScholarGoogle Scholar
  24. Daniel A. Keim, Jörn Kohlhammer, Geoffrey Ellis, and Florian Mansmann. 2010. Mastering the Information Age-Solving Problems with Visual Analytics. Eurographics Association, Goslar, Germany.Google ScholarGoogle Scholar
  25. Heidi Lam, Enrico Bertini, Petra Isenberg, Catherine Plaisant, and Sheelagh Carpendale. 2012. Empirical studies in information visualization: Seven scenarios. IEEE Transactions on Visualization and Computer Graphics, 18, 9, 1520--1536. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Hanseung Lee, Jaeyeon Kihm, Jaegul Choo, John Stasko, and Haesun Park. 2012. iVisClustering: An interactive visual document clustering via topic modeling. Computer Graphics Forum 31, 3, 1155--1164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. John A. Lee and Michel Verleysen. 2009. Quality assessment of dimensionality reduction: Rank-based criteria. Neurocomputing 72, 7--9, 1431--1443. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Bing Liu. 2012. Sentiment Analysis and Opinion Mining. Morgan & Claypool, San Francisco, CA.Google ScholarGoogle Scholar
  29. Shixia Liu, Michelle X. Zhou, Shimei Pan, Yangqiu Song, Weihong Qian, Weijia Cai, and Xiaoxiao Lian. 2012. TIARA: Interactive, topic-based visual text summarization and analysis. ACM Transactions on Intelligent System Technologies 3, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yujie Liu, Scott Barlowe, Yaqin Feng, Jing Yang, and Min Jiang. 2013. Evaluating exploratory visualization systems: A user study on how clustering-based visualization systems support information seeking from large document collections. Information Visualization 12, 1, 25--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Steffen Lohmann, Jürgen Ziegler, and Lena Tetzlaff. 2009. Comparison of tag cloud layouts: Task-related performance and visual exploration. In Human--Computer Interaction--INTERACT 2009. Springer, 392--404. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Vol. 1. Cambridge University Press, Cambridge, UK. Google ScholarGoogle Scholar
  33. Karl Pearson. 1900. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 50, 302, 157--175.Google ScholarGoogle ScholarCross RefCross Ref
  34. John Risch, Anne Kao, Stephen Poteet, and Y. Wu. 2008. Text visualization for visual text analytics. Visual Data Mining, 154--171.Google ScholarGoogle Scholar
  35. D. Sculley. 2010. Web-scale k-means clustering. In Proceedings of the 19th International Conference on World Wide Web. ACM, New York, NY, 1177--1178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Ajitesh Srivastava, Axel J. Soto, and Evangelos Milios. 2013. A graph-based topic extraction method enabling simple interactive customization. In Proceedings of the 2013 ACM Symposium on Document Engineering. ACM, New York, NY, 71--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Marc Strickert, Axel J. Soto, and Gustavo E. Vazquez. 2010. Adaptive matrix distances aiming at optimum regression subspaces. In Proceedings of the European Symposium on Artificial Neural Networks. D-facto Publications, 93--98.Google ScholarGoogle Scholar
  38. Laurens van der Maaten. 2009. Learning a parametric embedding by preserving local structure. Journal of Machine Learning Research—Proceedings Track 5, 384--391.Google ScholarGoogle Scholar
  39. L. J. P. van der Maaten, E. O. Postma, and H. J. van den Herik. 2009. Dimensionality Reduction: A Comparative Review. Technical Report. TiCC-TR 2009-005. Tilburg University, Tilburg, The Netherlands.Google ScholarGoogle Scholar
  40. Jarke van Wijk, Tobias Isenberg, Jos B. T. M. Roerdink, Alexandru C. Telea, and Michel Westenberg. 2010. Evaluation. In Mastering the Information Age-Solving Problems with Visual Analytics, Daniel A Keim, Jörn Kohlhammer, Geoffrey Ellis, and Florian Mansmann (Eds.). Eurographics Association, Goslar, Germany.Google ScholarGoogle Scholar
  41. Jarkko Venna and Samuel Kaski. 2001. Neighborhood preservation in nonlinear projection methods: An experimental study. In Proceedings of the International Conference on Artificial Neural Networks. Springer-Verlag, London, 485--491. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2009. Information theoretic measures for clusterings comparison: Is a correction for chance necessary? In Proceedings of the 26th Annual International Conference on Machine Learning. ACM, New York, NY, 1073--1080. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Hongning Wang, Yue Lu, and Chengxiang Zhai. 2010. Latent aspect rating analysis on review text data: A rating regression approach. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 783--792. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Daniela M. Witten and Robert Tibshirani. 2010. A framework for feature selection in clustering. Journal of the Amererican Statistical Association 105, 490.Google ScholarGoogle Scholar
  45. Yingcai Wu, Furu Wei, Shixia Liu, Norman Au, Weiwei Cui, Hong Zhou, and Huamin Qu. 2010. OpinionSeer: Interactive visualization of hotel customer feedback. IEEE Transactions on Visualization and Computer Graphics 16, 6, 1109--1118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Duo Zhang, Chengxiang Zhai, and Jiawei Han. 2009. Topic cube: Topic modeling for OLAP on multidimensional text databases. In Proceedings of the 2009 SIAM International Conference on Data Mining (SDM’09). 1123--1134.Google ScholarGoogle ScholarCross RefCross Ref
  47. Leishi Zhang, Andreas Stoffel, Michael Behrisch, Sebastian Mittelstadt, Tobias Schreck, René Pompl, Stefan Weber, Holger Last, and Daniel Keim. 2012. Visual analytics for the big data era—comparative review of state-of-the-art commercial systems. In IEEE Conference on Visual Analytics Science and Technology (VAST’12). IEEE, 173--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Jian Zhao, Christopher Collins, Fanny Chevalier, and Ravin Balakrishnan. 2013. Interactive exploration of implicit and explicit relations in faceted datasets. IEEE Transactions on Visualization and Computer Graphics 19, 12, 2080--2089. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Weizhong Zhu and Chaomei Chen. 2007. Storylines: Visual exploration and analysis in latent semantic spaces. Computers and Graphics 31, 3, 338--349. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Exploratory Visual Analysis and Interactive Pattern Extraction from Semi-Structured Data

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Interactive Intelligent Systems
        ACM Transactions on Interactive Intelligent Systems  Volume 5, Issue 3
        Special Issue on Behavior Understanding for Arts and Entertainment (Part 2 of 2) and Regular Articles
        October 2015
        181 pages
        ISSN:2160-6455
        EISSN:2160-6463
        DOI:10.1145/2821459
        Issue’s Table of Contents

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 8 September 2015
        • Accepted: 1 June 2015
        • Revised: 1 May 2015
        • Received: 1 August 2014
        Published in tiis Volume 5, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader