skip to main content
Mining and monitoring evolving data
Publisher:
  • The University of Wisconsin - Madison
ISBN:978-0-599-88402-1
Order Number:AAI9981907
Pages:
137
Bibliometrics
Skip Abstract Section
Abstract

Data mining algorithms have been the focus of much recent research. Most previous data mining algorithms have either assumed that the input data is static, or have been designed for arbitrary insertions and deletions of data records. In practice, the input data to a data mining process resides in a large data warehouse whose data is kept up-to-date through periodic or occasional addition of blocks of data. In this dissertation, we study two important issues: (1) exploiting the systematic data evolution for efficiently maintaining data mining models, and (2) monitoring changes in data characteristics.

Considering a dynamic environment that evolves through systematic addition or deletion of blocks of data, we introduce a new dimension called the data span dimension, which allows user-defined selections of a time-varying subset of the database. We then describe efficient model maintenance algorithms for such time-varying subsets.

A data mining algorithm builds a model that captures interesting characteristics in the underlying data. Therefore, we develop the FOCUS framework for quantifying the difference, called deviation, between two datasets in terms of the models they induce. Our framework covers a wide variety of models including frequent itemsets, decision tree classifiers, and clusters, and captures standard measures of deviation such as the misclassification rate (in Machine Learning) and the chi-squared metric (in Statistics) as special cases. We also show how statistical techniques can be applied to the deviation measure to assess whether the difference between two models is meaningful (i.e., whether the underlying datasets have statistically significant differences in their characteristics). We then apply the FOCUS framework to monitor changes in data characteristics and to interactively explore datasets for unusual behavior.

Contributors
  • Google LLC
  • Microsoft Corporation

Recommendations