demonstration

Data mining algorithms as a service in the cloud exploiting relational database systems

Authors:
Carlos Ordonez

University of Houston, Houston, USA

University of Houston, Houston, USA
View Profile

,
Javier García-García

Instituto Politécnico Nacional, Mexico, TX, Mexico

Instituto Politécnico Nacional, Mexico, TX, Mexico
View Profile

,
Carlos Garcia-Alvarado

Greenplum/EMC, San Mateo, CA, USA

Greenplum/EMC, San Mateo, CA, USA
View Profile

,
Wellington Cabrera

University of Houston, Houston, TX, USA

University of Houston, Houston, TX, USA
View Profile

,
Veerabhadran Baladandayuthapani

UT MD Anderson C.C., Houston, TX, USA

UT MD Anderson C.C., Houston, TX, USA
View Profile

,
Mohammed S. Quraishi

University of Houston, Houston, TX, USA

University of Houston, Houston, TX, USA
View Profile

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of DataJune 2013Pages 1001–1004https://doi.org/10.1145/2463676.2465240

Published:22 June 2013Publication History

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Pages 1001–1004

ABSTRACT

We present a novel cloud system based on DBMS technology, where data mining algorithms are offered as a service. A local DBMS connects to the cloud and the cloud system returns computed data mining models as small relational tables that are archived and which can be easily transferred, queried and integrated with the client database. Unlike other analytic systems, our solution is not based on MapReduce. Our system avoids exporting large tables outside the local DBMS and thus it avoids transmitting large volumes of data to the cloud. The system offers three processing modes: local, cloud and hybrid, where a linear cost model is used to choose processing mode. In hybrid mode processing is split between the local DBMS and the cloud DBMS. Our system has a job scheduler with FIFO, SJF and RR policies to enhance response time and get partial results early. The cloud DBMS performs dynamic job scheduling, model computation and model archive management. Our system incorporates several optimizations: local data set summarization with sufficient statistics, sampling, caching matrices in RAM and selectively transmitting small matrices, back and forth. We show that in general the most efficient computing mechanism is hybrid processing: summarizing or sampling the data set in the local DBMS, transferring small matrices back and forth, leaving mathematically complex methods as a task for the cloud DBMS.

References

M. Armbrust, A. Fox, R. Griffith, A.D. Joseph, R. H. Katz, A. Konwinski, G. Lee, D.A. Patterson, A. Rabkin, I. Stoica, and M. Zaharia. A view of cloud computing. Commun. ACM, 53(4):50--58, 2010. Google ScholarDigital Library
M. Navas, C. Ordonez, and V. Baladandayuthapani. On the computation of stochastic search variable selection in linear regression with UDFs. In Proc. IEEE ICDM Conference, pages 941 -- 946, 2010. Google ScholarDigital Library
C. Ordonez. Statistical model computation with UDFs. IEEE Transactions on Knowledge and Data Engineering (TKDE), 22(12):1752--1765, 2010. Google ScholarDigital Library
C. Ordonez and S.K. Pitchaimalai. Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling. Data and Knowledge Engineering, 69(4):383--398, 2010. Google ScholarDigital Library
M. Stonebraker, D. Abadi, D.J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. MapReduce and parallel DBMSs: friends or foes? Commun. ACM, 53(1):64--71, 2010. Google ScholarDigital Library

Index Terms

Data mining algorithms as a service in the cloud exploiting relational database systems
1. Information systems
  1. Data management systems
    1. Database design and models
      1. Relational database model
    2. Query languages
      1. Relational database query languages

Recommendations

Building statistical models and scoring with UDFs
SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data

Multidimensional statistical models are generally computed outside a relational DBMS, exporting data sets. This article explains how fundamental multidimensional statistical models are computed inside the DBMS in a single table scan exploiting SQL and ...
Read More
A data mining system based on SQL queries and UDFs for relational databases
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

Most research on data mining has proposed algorithms and optimizations that work on flat files, outside a DBMS, mainly due to the following reasons. It is easier to develop efficient algorithms in a traditional programming language. The integration of ...
Read More
Database systems research on data mining
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Data mining remains an important research area in database systems. We present a review of processing alternatives, storage mechanisms, algorithms, data structures and optimizations that enable data mining on large data sets. We focus on the computation ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
June 2013
1322 pages
ISBN:9781450320375
DOI:10.1145/2463676
General Chairs:
Kenneth Ross
Columbia University
,
Divesh Srivastava
AT&T Research
,
Program Chair:
Dimitris Papadias
HKUST
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 June 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
algorithm
cloud
dbms
statistical model
udf
Qualifiers
- demonstration
Conference

Acceptance Rates
SIGMOD '13 Paper Acceptance Rate76of372submissions,20%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 481
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Data mining algorithms as a service in the cloud exploiting relational database systems

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Building statistical models and scoring with UDFs

A data mining system based on SQL queries and UDFs for relational databases

Database systems research on data mining