research-article

Exploratory Visual Analysis and Interactive Pattern Extraction from Semi-Structured Data

Authors:
Axel J. Soto

Dalhousie University, Nova Scotia, Canada

Dalhousie University, Nova Scotia, Canada
View Profile

,
Ryan Kiros

University of Toronto, Ontario, Canada

University of Toronto, Ontario, Canada
View Profile

,
Vlado Kešelj

Dalhousie University, Nova Scotia, Canada

Dalhousie University, Nova Scotia, Canada
View Profile

,
Evangelos Milios

Dalhousie University, Nova Scotia, Canada

Dalhousie University, Nova Scotia, Canada
View Profile

ACM Transactions on Interactive Intelligent Systems Volume 5 Issue 3Article No.: 16pp 1–36https://doi.org/10.1145/2812115

Published:08 September 2015Publication History

ACM Transactions on Interactive Intelligent Systems

Abstract

Semi-structured documents are a common type of data containing free text in natural language (unstructured data) as well as additional information about the document, or meta-data, typically following a schema or controlled vocabulary (structured data). Simultaneous analysis of unstructured and structured data enables the discovery of hidden relationships that cannot be identified from either of these sources when analyzed independently of each other. In this work, we present a visual text analytics tool for semi-structured documents (ViTA-SSD), that aims to support the user in the exploration and finding of insightful patterns in a visual and interactive manner in a semi-structured collection of documents. It achieves this goal by presenting to the user a set of coordinated visualizations that allows the linking of the metadata with interactively generated clusters of documents in such a way that relevant patterns can be easily spotted. The system contains two novel approaches in its back end: a feature-learning method to learn a compact representation of the corpus and a fast-clustering approach that has been redesigned to allow user supervision. These novel contributions make it possible for the user to interact with a large and dynamic document collection and to perform several text analytical tasks more efficiently. Finally, we present two use cases that illustrate the suitability of the system for in-depth interactive exploration of semi-structured document collections, two user studies, and results of several evaluations of our text-mining components.

Supplemental Material

Available for Download

zip

soto.zip (31.3 MB)

Supplemental movie, appendix, image and software files for, Exploratory Visual Analysis and Interactive Pattern Extraction from Semi-Structured Data

References

Richard Arias-Hernandez, Linda T. Kaastra, Tera Marie Green, and Brian Fisher. 2011. Pair analytics: Capturing reasoning processes in collaborative visual analytics. In 2011 44th Hawaii International Conference on System Sciences (HICSS’11). IEEE, 1--10. Google ScholarDigital Library
Sugato Basu, Arindam Banerjee, and Raymond J. Mooney. 2002. Semi-supervised clustering by seeding. In International Conference on Machine Learning, Vol. 2. 27--34. Google ScholarDigital Library
Yoshua Bengio. 2009. Learning deep architectures for AI. Foundations and Trends in Machine Learning 2, 1, 1--127. Google ScholarDigital Library
Jaegul Choo, Shawn Bohn, and Haesun Park. 2009. Two-stage framework for visualization of clustered high dimensional data. In IEEE Symposium on Visual Analytics Science and Technology, 2009 (VAST’09). IEEE, 67--74.Google ScholarCross Ref
Jaegul Choo, Changhyun Lee, Chandan K. Reddy, and Haesun Park. 2013. UTOPIAN: User-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Transactions on Visualization and Computer Graphics 19, 12, 1992--2001. Google ScholarDigital Library
David Cohn, Rich Caruana, and Andrew McCallum. 2003. Semi-supervised clustering with user feedback. In Constrained Clustering: Advances in Algorithms, Theory, and Applications. Chapman & Hall, Boca Raton, FL, 17--32.Google Scholar
Christopher Collins, Fernanda B. Viegas, and Martin Wattenberg. 2009. Parallel tag clouds to explore and analyze faceted text corpora. In IEEE Symposium on Visual Analytics Science and Technology, 2009 (VAST’09). IEEE, 91--98.Google ScholarCross Ref
Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 6, 391--407.Google ScholarCross Ref
Wenwen Dou, Xiaoyu Wang, Drew Skau, William Ribarsky, and Michelle X. Zhou. 2012. LeadLine: Interactive visual analysis of text data through event identification and exploration. In IEEE Conference on Visual Analytics Science and Technology, 2012 (VAST’12). IEEE, 93--102. Google ScholarDigital Library
Dumitru Erhan, Yoshua Bengio, Pierre-Antoine Courville, Aaronand Manzagol, Pascal Vincent, and Samy Bengio. 2010. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research 11, 625--660. Google ScholarDigital Library
Felice C. Frankel and Angela H. DePace. 2012. Visual Strategies: a Practical Guide to Graphics for Scientists & Engineers. Yale University Press, New Haven, CT.Google Scholar
Carsten Gorg, Zhicheng Liu, Jaeyeon Kihm, Jaegul Choo, Haesun Park, and John Stasko. 2013. Combining computational analyses and interactive visualization for document exploration and sensemaking in jigsaw. IEEE Transactions on Visualization and Computer Graphics, 19, 10, 1646--1663.Google ScholarCross Ref
David Gotz and Michelle X. Zhou. 2009. Characterizing users’ visual analytic activity for insight provenance. Information Visualization 8, 1, 42--55. Google ScholarDigital Library
Michelle L. Gregory, Deborah Payne, David McColgin, Nicolas Cramer, and Douglas Love. 2007. Visual analysis of weblog content. In International Conference on Weblogs and Social Media.Google Scholar
Geoffrey E. Hinton and Ruslan R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786, 504--507.Google Scholar
Yeming Hu, Evangelos E. Milios, and James Blustein. 2012. Enhancing semi-supervised document clustering with feature supervision. In Proceedings of the 27th Annual ACM Symposium on Applied Computing. ACM, New York, NY, 929--936. Google ScholarDigital Library
Shiping Huang, Matthew O. Ward, and Elke A. Rundensteiner. 2005. Exploration of dimensionality reduction for text visualization. In Proceedings of the 3rd International Conference on Coordinated and Multiple Views in Exploratory Visualization, 2005 (CMV2’05). IEEE, 63--74. Google ScholarDigital Library
Akihiro Inokuchi and Koichi Takeda. 2007. A method for online analytical processing of text data. In Proceedings of the 16th ACM Conference on Conference on Information and Knowledge Management (CIKM’07), Vol. 7. 455--464. Google ScholarDigital Library
Nazanin Kadivar, Victor Chen, Dustin Dunsmuir, Eric Lee, Cheryl Qian, John Dill, Christopher Shaw, and R. Woodbury. 2009. Capturing and supporting the analysis process. In IEEE Symposium on Visual Analytics Science and Technology, 2009 (VAST’09). IEEE, 131--138.Google Scholar
Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2012. Enterprise data analysis and visualization: An interview study. In IEEE Conference on Visual Analytics Science and Technology, 2012 (VAST’12).Google ScholarDigital Library
Youn-ah Kang, C. Gorg, and John Stasko. 2009. Evaluating visual analytics systems for investigative analysis: Deriving design principles from a case study. In IEEE Symposium on Visual Analytics Science and Technology, 2009 (VAST’09). IEEE, 139--146.Google Scholar
Youn-ah Kang and John Stasko. 2012. Examining the use of a visual analytics system for sensemaking tasks: Case studies with domain experts. IEEE Transactions on Visualization and Computer Graphics, 18, 12, 2869--2878. Google ScholarDigital Library
Anne Kao, Stephen Poteet, and David Augustine. 2011. Extracting critical information from free text data for systems health management. In Machine Learning and Knowledge Discovery for Engineering Systems Health Management, Ashok N. Srivastava and Jiawei Han (Eds.). CRC Press, Boca Raton, FL, 423--450.Google Scholar
Daniel A. Keim, Jörn Kohlhammer, Geoffrey Ellis, and Florian Mansmann. 2010. Mastering the Information Age-Solving Problems with Visual Analytics. Eurographics Association, Goslar, Germany.Google Scholar
Heidi Lam, Enrico Bertini, Petra Isenberg, Catherine Plaisant, and Sheelagh Carpendale. 2012. Empirical studies in information visualization: Seven scenarios. IEEE Transactions on Visualization and Computer Graphics, 18, 9, 1520--1536. Google ScholarDigital Library
Hanseung Lee, Jaeyeon Kihm, Jaegul Choo, John Stasko, and Haesun Park. 2012. iVisClustering: An interactive visual document clustering via topic modeling. Computer Graphics Forum 31, 3, 1155--1164. Google ScholarDigital Library
John A. Lee and Michel Verleysen. 2009. Quality assessment of dimensionality reduction: Rank-based criteria. Neurocomputing 72, 7--9, 1431--1443. Google ScholarDigital Library
Bing Liu. 2012. Sentiment Analysis and Opinion Mining. Morgan & Claypool, San Francisco, CA.Google Scholar
Shixia Liu, Michelle X. Zhou, Shimei Pan, Yangqiu Song, Weihong Qian, Weijia Cai, and Xiaoxiao Lian. 2012. TIARA: Interactive, topic-based visual text summarization and analysis. ACM Transactions on Intelligent System Technologies 3, 2. Google ScholarDigital Library
Yujie Liu, Scott Barlowe, Yaqin Feng, Jing Yang, and Min Jiang. 2013. Evaluating exploratory visualization systems: A user study on how clustering-based visualization systems support information seeking from large document collections. Information Visualization 12, 1, 25--43. Google ScholarDigital Library
Steffen Lohmann, Jürgen Ziegler, and Lena Tetzlaff. 2009. Comparison of tag cloud layouts: Task-related performance and visual exploration. In Human--Computer Interaction--INTERACT 2009. Springer, 392--404. Google ScholarDigital Library
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Vol. 1. Cambridge University Press, Cambridge, UK. Google Scholar
Karl Pearson. 1900. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 50, 302, 157--175.Google ScholarCross Ref
John Risch, Anne Kao, Stephen Poteet, and Y. Wu. 2008. Text visualization for visual text analytics. Visual Data Mining, 154--171.Google Scholar
D. Sculley. 2010. Web-scale k-means clustering. In Proceedings of the 19th International Conference on World Wide Web. ACM, New York, NY, 1177--1178. Google ScholarDigital Library
Ajitesh Srivastava, Axel J. Soto, and Evangelos Milios. 2013. A graph-based topic extraction method enabling simple interactive customization. In Proceedings of the 2013 ACM Symposium on Document Engineering. ACM, New York, NY, 71--80. Google ScholarDigital Library
Marc Strickert, Axel J. Soto, and Gustavo E. Vazquez. 2010. Adaptive matrix distances aiming at optimum regression subspaces. In Proceedings of the European Symposium on Artificial Neural Networks. D-facto Publications, 93--98.Google Scholar
Laurens van der Maaten. 2009. Learning a parametric embedding by preserving local structure. Journal of Machine Learning Research—Proceedings Track 5, 384--391.Google Scholar
L. J. P. van der Maaten, E. O. Postma, and H. J. van den Herik. 2009. Dimensionality Reduction: A Comparative Review. Technical Report. TiCC-TR 2009-005. Tilburg University, Tilburg, The Netherlands.Google Scholar
Jarke van Wijk, Tobias Isenberg, Jos B. T. M. Roerdink, Alexandru C. Telea, and Michel Westenberg. 2010. Evaluation. In Mastering the Information Age-Solving Problems with Visual Analytics, Daniel A Keim, Jörn Kohlhammer, Geoffrey Ellis, and Florian Mansmann (Eds.). Eurographics Association, Goslar, Germany.Google Scholar
Jarkko Venna and Samuel Kaski. 2001. Neighborhood preservation in nonlinear projection methods: An experimental study. In Proceedings of the International Conference on Artificial Neural Networks. Springer-Verlag, London, 485--491. Google ScholarDigital Library
Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2009. Information theoretic measures for clusterings comparison: Is a correction for chance necessary? In Proceedings of the 26th Annual International Conference on Machine Learning. ACM, New York, NY, 1073--1080. Google ScholarDigital Library
Hongning Wang, Yue Lu, and Chengxiang Zhai. 2010. Latent aspect rating analysis on review text data: A rating regression approach. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 783--792. Google ScholarDigital Library
Daniela M. Witten and Robert Tibshirani. 2010. A framework for feature selection in clustering. Journal of the Amererican Statistical Association 105, 490.Google Scholar
Yingcai Wu, Furu Wei, Shixia Liu, Norman Au, Weiwei Cui, Hong Zhou, and Huamin Qu. 2010. OpinionSeer: Interactive visualization of hotel customer feedback. IEEE Transactions on Visualization and Computer Graphics 16, 6, 1109--1118. Google ScholarDigital Library
Duo Zhang, Chengxiang Zhai, and Jiawei Han. 2009. Topic cube: Topic modeling for OLAP on multidimensional text databases. In Proceedings of the 2009 SIAM International Conference on Data Mining (SDM’09). 1123--1134.Google ScholarCross Ref
Leishi Zhang, Andreas Stoffel, Michael Behrisch, Sebastian Mittelstadt, Tobias Schreck, René Pompl, Stefan Weber, Holger Last, and Daniel Keim. 2012. Visual analytics for the big data era—comparative review of state-of-the-art commercial systems. In IEEE Conference on Visual Analytics Science and Technology (VAST’12). IEEE, 173--182. Google ScholarDigital Library
Jian Zhao, Christopher Collins, Fanny Chevalier, and Ravin Balakrishnan. 2013. Interactive exploration of implicit and explicit relations in faceted datasets. IEEE Transactions on Visualization and Computer Graphics 19, 12, 2080--2089. Google ScholarDigital Library
Weizhong Zhu and Chaomei Chen. 2007. Storylines: Visual exploration and analysis in latent semantic spaces. Computers and Graphics 31, 3, 338--349. Google ScholarDigital Library

Index Terms

Exploratory Visual Analysis and Interactive Pattern Extraction from Semi-Structured Data
1. Information systems
  1. Information retrieval
  2. Information systems applications
    1. Data mining

Recommendations

TIARA: a visual exploratory text analytic system
KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

In this paper, we present a novel exploratory visual analytic system called TIARA (Text Insight via Automated Responsive Analytics), which combines text analytics and interactive visualization to help users explore and analyze large collections of text. ...
Read More
Visual content correlation analysis
IVITA '10: Proceedings of the first international workshop on Intelligent visual interfaces for text analysis

Correlating content from multiple data fields is one of the key challenges in text mining. In this paper, we propose a visual analytics approach that leverages both content correlation analysis and interactive visualization technologies in analyzing and ...
Read More
Logical structure based semantic relationship extraction from semi-structured documents
WWW '06: Proceedings of the 15th international conference on World Wide Web

Addressed in this paper is the issue of semantic relationship extraction from semi-structured documents. Many research efforts have been made so far on the semantic information extraction. However, much of the previous work focuses on detecting `...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Interactive Intelligent Systems Volume 5, Issue 3
Special Issue on Behavior Understanding for Arts and Entertainment (Part 2 of 2) and Regular Articles
October 2015
181 pages
ISSN:2160-6455
EISSN:2160-6463
DOI:10.1145/2821459
Editors:
Anthony Jameson
German Research Center for Artificial Intelligence (DFKI), Germany
,
Krzysztof Gajos
Harvard University, U.S.A
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 September 2015
- Accepted: 1 June 2015
- Revised: 1 May 2015
- Received: 1 August 2014
Published in tiis Volume 5, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Best Paper
Author Tags
Visual text analytics
dimensionality reduction
interactive clustering
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 905
  Total Downloads
- Downloads (Last 12 months)25
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Exploratory Visual Analysis and Interactive Pattern Extraction from Semi-Structured Data

ACM Transactions on Interactive Intelligent Systems

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

TIARA: a visual exploratory text analytic system

Visual content correlation analysis

Logical structure based semantic relationship extraction from semi-structured documents