research-article

Public Access

Simba: Efficient In-Memory Spatial Analytics

Authors:
Dong Xie

University of Utah, Salt Lake City, UT, USA

University of Utah, Salt Lake City, UT, USA
View Profile

,
Feifei Li

University of Utah, Salt Lake City, UT, USA

University of Utah, Salt Lake City, UT, USA
View Profile

,
Bin Yao

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China
View Profile

,
Gefei Li

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China
View Profile

,
Liang Zhou

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China
View Profile

,
Minyi Guo

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China
View Profile

SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataJune 2016Pages 1071–1085https://doi.org/10.1145/2882903.2915237

Published:14 June 2016Publication History

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Pages 1071–1085

ABSTRACT

Large spatial data becomes ubiquitous. As a result, it is critical to provide fast, scalable, and high-throughput spatial queries and analytics for numerous applications in location-based services (LBS). Traditional spatial databases and spatial analytics systems are disk-based and optimized for IO efficiency. But increasingly, data are stored and processed in memory to achieve low latency, and CPU time becomes the new bottleneck. We present the Simba (Spatial In-Memory Big data Analytics) system that offers scalable and efficient in-memory spatial query processing and analytics for big spatial data. Simba is based on Spark and runs over a cluster of commodity machines. In particular, Simba extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and the DataFrame API. It introduces indexes over RDDs in order to work with big spatial data and complex spatial operations. Lastly, Simba implements an effective query optimizer, which leverages its indexes and novel spatial-aware optimizations, to achieve both low latency and high throughput. Extensive experiments over large data sets demonstrate Simba's superior performance compared against other spatial analytics system.

References

http://www.comp.nus.edu.sg/~dbsystem/source.html.Google Scholar
https://github.com/amplab/spark-indexedrdd.Google Scholar
Apache accumulo. http://accumulo.apache.org.Google Scholar
Apache avro project. http://avro.apache.org.Google Scholar
Apache parquet project. http://parquet.incubator.apache.org.Google Scholar
Apache spark project. http://spark.apache.org.Google Scholar
Apache zookeeper. https://zookeeper.apache.org/.Google Scholar
Gdelt project. http://www.gdeltproject.org.Google Scholar
Openstreepmap project. http://www.openstreetmap.org.Google Scholar
R project for statistical computing. http://www.r-project.org.Google Scholar
A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. Saltz. Hadoop gis: a high performance spatial data warehousing system over mapreduce. In VLDB, 2013. Google ScholarDigital Library
A. Akdogan, U. Demiryurek, F. Banaei-Kashani, and C. Shahabi. Voronoi-based geospatial query processing with mapreduce. In CouldCom, 2010. Google ScholarDigital Library
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et~al. Spark sql: Relational data processing in spark. In SIGMOD, 2015. Google ScholarDigital Library
N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The r*-tree: An efficient and robust access method for points and rectangles. In SIGMOD, 1990. Google ScholarDigital Library
A. Cary, Z. Sun, V. Hristidis, and N. Rishe. Experiences on processing spatial data with mapreduce. In Scientific and Statistical Database Management, 2009. Google ScholarDigital Library
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. TOCS, 2008. Google ScholarDigital Library
S. Chaudhuri. An overview of query optimization in relational systems. In PODS, pages 34--43, 1998. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarDigital Library
A. Eldawy, L. Alarabi, and M. F. Mokbel. Spatial partitioning techniques in spatial hadoop. PVLDB, 2015. Google ScholarDigital Library
A. Eldawy, Y. Li, M. F. Mokbel, and R. Janardan. Cg_hadoop: computational geometry in mapreduce. In SIGSPATIAL, 2013. Google ScholarDigital Library
A. Eldawy and M. F. Mokbel. Pigeon: A spatial mapreduce language. In ICDE, 2014.Google ScholarCross Ref
A. Eldawy and M. F. Mokbel. Spatialhadoop: A mapreduce framework for spatial data. In ICDE, 2015.Google ScholarCross Ref
A. Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD, 1984. Google ScholarDigital Library
J. N. Hughes, A. Annex, C. N. Eichelberger, A. Fox, A. Hulbert, and M. Ronquest. Geomesa: a distributed architecture for spatio-temporal fusion. In SPIE DefenseGoogle Scholar
Security, 2015.Google Scholar
V. Leis, A. Kemper, and T. Neumann. The adaptive radix tree: Artful indexing for main-memory databases. In ICDE, 2013. Google ScholarDigital Library
S. T. Leutenegger, M. Lopez, J. Edgington, et~al. STR: A simple and efficient algorithm for R-tree packing. In ICDE, 1997. Google ScholarDigital Library
W. Lu, Y. Shen, S. Chen, and B. C. Ooi. Efficient processing of k nearest neighbor joins using mapreduce. In VLDB, 2012. Google ScholarDigital Library
Q. Ma, B. Yang, W. Qian, and A. Zhou. Query processing of massive trajectory data based on mapreduce. In Proceedings of the first international workshop on Cloud data management, 2009. Google ScholarDigital Library
S. Nishimura, S. Das, D. Agrawal, and A. El~Abbadi. MD-hbase: design and implementation of an elastic data infrastructure for cloud-scale location services. In DAPD, 2013. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD, 2008. Google ScholarDigital Library
N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In SIGMOD, 1995. Google ScholarDigital Library
H. Samet. Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling). Morgan Kaufmann Publishers Inc., 2005. Google ScholarDigital Library
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: A warehousing solution over a map-reduce framework. In PVDLB, 2009. Google ScholarDigital Library
H. Vo, A. Aji, and F. Wang. Sato: A spatial data partitioning framework for scalable query processing. In SIGSPATIAL, 2014. Google ScholarDigital Library
R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: Sql and rich analytics at scale. In SIGMOD, 2013. Google ScholarDigital Library
S. You, J. Zhang, and L. Gruenwald. Large-scale spatial join query processing in cloud. In IEEE CloudDM workshop (To Appear), 2015.Google ScholarCross Ref
J. Yu, J. Wu, and M. Sarwat. Geospark: A cluster computing framework for processing large-scale spatial data. In SIGSPATIAL GIS, 2015. Google ScholarDigital Library
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google ScholarDigital Library
C. Zhang, F. Li, and J. Jestes. Efficient parallel knn joins for large data in mapreduce. In EDBT, 2012. Google ScholarDigital Library
S. Zhang, J. Han, Z. Liu, K. Wang, and S. Feng. Spatial queries evaluation with mapreduce. In ICGCC, 2009. Google ScholarDigital Library
S. Zhang, J. Han, Z. Liu, K. Wang, and Z. Xu. Sjmr: Parallelizing spatial join with mapreduce on clusters. In IEEE ICCC, 2009.Google ScholarCross Ref

Index Terms

Simba: Efficient In-Memory Spatial Analytics
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs

Recommendations

A Performance Study of Big Spatial Data Systems
BigSpatial '18: Proceedings of the 7th ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data

With the accelerated growth in spatial data volume, being generated from a wide variety of sources, the need for efficient storage, retrieval, processing and analyzing of spatial data is ever more important. Hence, spatial data processing system has ...
Read More
Making sense of performance in in-memory computing frameworks for scientific data analysis: A case study of the spark system
Abstract
Over the last five years, Apache Spark has become a major software platform for in-memory data analysis. Acknowledging its widespread use, we present a comprehensive study of system characteristics of Spark targeting scientific data ...
Highlights
- We develop a benchmark, ArrayBench, for benchmarking scientific data analytics that process gene expression matrices using Spark and SciDB.
Read More
The Era of Big Spatial Data: Challenges and Opportunities
MDM '15: Proceedings of the 2015 16th IEEE International Conference on Mobile Data Management - Volume 02

This seminar describes the state-of-the-art research in the area of big spatial data and it consists of four parts. Part I gives a background about big spatial data and the limitations of traditional systems in handling such data. Part II gives an ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
June 2016
2300 pages
ISBN:9781450335317
DOI:10.1145/2882903
General Chairs:
Fatma Özcan
IBM Research, USA
,
Georgia Koutrika
HP Labs, USA
,
Program Chair:
Sam Madden
Massachusetts Institute of Technology, USA
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 June 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
big spatial data
in-memory computing
main-memory DBMS
spark
spark SQL
spatial analytics
spatial queries
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 199
  Total Citations
  View Citations
- 2,282
  Total Downloads
- Downloads (Last 12 months)225
- Downloads (Last 6 weeks)49
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Simba: Efficient In-Memory Spatial Analytics

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Performance Study of Big Spatial Data Systems

Making sense of performance in in-memory computing frameworks for scientific data analysis: A case study of the spark system

The Era of Big Spatial Data: Challenges and Opportunities

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Simba: Efficient In-Memory Spatial Analytics

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Performance Study of Big Spatial Data Systems

Making sense of performance in in-memory computing frameworks for scientific data analysis: A case study of the spark system

The Era of Big Spatial Data: Challenges and Opportunities

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media