Data mining algorithms have been the focus of much recent research. Most previous data mining algorithms have either assumed that the input data is static, or have been designed for arbitrary insertions and deletions of data records. In practice, the input data to a data mining process resides in a large data warehouse whose data is kept up-to-date through periodic or occasional addition of blocks of data. In this dissertation, we study two important issues: (1) exploiting the systematic data evolution for efficiently maintaining data mining models, and (2) monitoring changes in data characteristics.
Considering a dynamic environment that evolves through systematic addition or deletion of blocks of data, we introduce a new dimension called the data span dimension, which allows user-defined selections of a time-varying subset of the database. We then describe efficient model maintenance algorithms for such time-varying subsets.
A data mining algorithm builds a model that captures interesting characteristics in the underlying data. Therefore, we develop the FOCUS framework for quantifying the difference, called deviation, between two datasets in terms of the models they induce. Our framework covers a wide variety of models including frequent itemsets, decision tree classifiers, and clusters, and captures standard measures of deviation such as the misclassification rate (in Machine Learning) and the chi-squared metric (in Statistics) as special cases. We also show how statistical techniques can be applied to the deviation measure to assess whether the difference between two models is meaningful (i.e., whether the underlying datasets have statistically significant differences in their characteristics). We then apply the FOCUS framework to monitor changes in data characteristics and to interactively explore datasets for unusual behavior.
Recommendations
Mining and monitoring evolving data
Handbook of massive data setsData mining algorithms have been the focus of much recent research. The initial spurt of research on data mining algorithms typically considered static datasets. In practice, the input data to a data mining process resides in a large data warehouse ...
Data mining without data: a novel approach to privacy-preserving collaborative distributed data mining
WPES '11: Proceedings of the 10th annual ACM workshop on Privacy in the electronic societyWith the proliferation of organizations that independently collect various types of data, with the growing awareness of corporations and public to keep their sensitive data private, and with the ever-increasing need of government and corporate policy ...
Provenance for data mining
TaPP '13: Proceedings of the 5th USENIX Workshop on the Theory and Practice of ProvenanceData mining aims at extracting useful information from large datasets. Most data mining approaches reduce the input data to produce a smaller output summarizing the mining result. While the purpose of data mining (extracting information) necessitates ...