ABSTRACT
Large spatial data becomes ubiquitous. As a result, it is critical to provide fast, scalable, and high-throughput spatial queries and analytics for numerous applications in location-based services (LBS). Traditional spatial databases and spatial analytics systems are disk-based and optimized for IO efficiency. But increasingly, data are stored and processed in memory to achieve low latency, and CPU time becomes the new bottleneck. We present the Simba (Spatial In-Memory Big data Analytics) system that offers scalable and efficient in-memory spatial query processing and analytics for big spatial data. Simba is based on Spark and runs over a cluster of commodity machines. In particular, Simba extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and the DataFrame API. It introduces indexes over RDDs in order to work with big spatial data and complex spatial operations. Lastly, Simba implements an effective query optimizer, which leverages its indexes and novel spatial-aware optimizations, to achieve both low latency and high throughput. Extensive experiments over large data sets demonstrate Simba's superior performance compared against other spatial analytics system.
- http://www.comp.nus.edu.sg/~dbsystem/source.html.Google Scholar
- https://github.com/amplab/spark-indexedrdd.Google Scholar
- Apache accumulo. http://accumulo.apache.org.Google Scholar
- Apache avro project. http://avro.apache.org.Google Scholar
- Apache parquet project. http://parquet.incubator.apache.org.Google Scholar
- Apache spark project. http://spark.apache.org.Google Scholar
- Apache zookeeper. https://zookeeper.apache.org/.Google Scholar
- Gdelt project. http://www.gdeltproject.org.Google Scholar
- Openstreepmap project. http://www.openstreetmap.org.Google Scholar
- R project for statistical computing. http://www.r-project.org.Google Scholar
- A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. Saltz. Hadoop gis: a high performance spatial data warehousing system over mapreduce. In VLDB, 2013. Google ScholarDigital Library
- A. Akdogan, U. Demiryurek, F. Banaei-Kashani, and C. Shahabi. Voronoi-based geospatial query processing with mapreduce. In CouldCom, 2010. Google ScholarDigital Library
- M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et~al. Spark sql: Relational data processing in spark. In SIGMOD, 2015. Google ScholarDigital Library
- N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The r*-tree: An efficient and robust access method for points and rectangles. In SIGMOD, 1990. Google ScholarDigital Library
- A. Cary, Z. Sun, V. Hristidis, and N. Rishe. Experiences on processing spatial data with mapreduce. In Scientific and Statistical Database Management, 2009. Google ScholarDigital Library
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. TOCS, 2008. Google ScholarDigital Library
- S. Chaudhuri. An overview of query optimization in relational systems. In PODS, pages 34--43, 1998. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarDigital Library
- A. Eldawy, L. Alarabi, and M. F. Mokbel. Spatial partitioning techniques in spatial hadoop. PVLDB, 2015. Google ScholarDigital Library
- A. Eldawy, Y. Li, M. F. Mokbel, and R. Janardan. Cg_hadoop: computational geometry in mapreduce. In SIGSPATIAL, 2013. Google ScholarDigital Library
- A. Eldawy and M. F. Mokbel. Pigeon: A spatial mapreduce language. In ICDE, 2014.Google ScholarCross Ref
- A. Eldawy and M. F. Mokbel. Spatialhadoop: A mapreduce framework for spatial data. In ICDE, 2015.Google ScholarCross Ref
- A. Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD, 1984. Google ScholarDigital Library
- J. N. Hughes, A. Annex, C. N. Eichelberger, A. Fox, A. Hulbert, and M. Ronquest. Geomesa: a distributed architecture for spatio-temporal fusion. In SPIE DefenseGoogle Scholar
- Security, 2015.Google Scholar
- V. Leis, A. Kemper, and T. Neumann. The adaptive radix tree: Artful indexing for main-memory databases. In ICDE, 2013. Google ScholarDigital Library
- S. T. Leutenegger, M. Lopez, J. Edgington, et~al. STR: A simple and efficient algorithm for R-tree packing. In ICDE, 1997. Google ScholarDigital Library
- W. Lu, Y. Shen, S. Chen, and B. C. Ooi. Efficient processing of k nearest neighbor joins using mapreduce. In VLDB, 2012. Google ScholarDigital Library
- Q. Ma, B. Yang, W. Qian, and A. Zhou. Query processing of massive trajectory data based on mapreduce. In Proceedings of the first international workshop on Cloud data management, 2009. Google ScholarDigital Library
- S. Nishimura, S. Das, D. Agrawal, and A. El~Abbadi. MD-hbase: design and implementation of an elastic data infrastructure for cloud-scale location services. In DAPD, 2013. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD, 2008. Google ScholarDigital Library
- N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In SIGMOD, 1995. Google ScholarDigital Library
- H. Samet. Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling). Morgan Kaufmann Publishers Inc., 2005. Google ScholarDigital Library
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: A warehousing solution over a map-reduce framework. In PVDLB, 2009. Google ScholarDigital Library
- H. Vo, A. Aji, and F. Wang. Sato: A spatial data partitioning framework for scalable query processing. In SIGSPATIAL, 2014. Google ScholarDigital Library
- R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: Sql and rich analytics at scale. In SIGMOD, 2013. Google ScholarDigital Library
- S. You, J. Zhang, and L. Gruenwald. Large-scale spatial join query processing in cloud. In IEEE CloudDM workshop (To Appear), 2015.Google ScholarCross Ref
- J. Yu, J. Wu, and M. Sarwat. Geospark: A cluster computing framework for processing large-scale spatial data. In SIGSPATIAL GIS, 2015. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google ScholarDigital Library
- C. Zhang, F. Li, and J. Jestes. Efficient parallel knn joins for large data in mapreduce. In EDBT, 2012. Google ScholarDigital Library
- S. Zhang, J. Han, Z. Liu, K. Wang, and S. Feng. Spatial queries evaluation with mapreduce. In ICGCC, 2009. Google ScholarDigital Library
- S. Zhang, J. Han, Z. Liu, K. Wang, and Z. Xu. Sjmr: Parallelizing spatial join with mapreduce on clusters. In IEEE ICCC, 2009.Google ScholarCross Ref
Index Terms
- Simba: Efficient In-Memory Spatial Analytics
Recommendations
A Performance Study of Big Spatial Data Systems
BigSpatial '18: Proceedings of the 7th ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial DataWith the accelerated growth in spatial data volume, being generated from a wide variety of sources, the need for efficient storage, retrieval, processing and analyzing of spatial data is ever more important. Hence, spatial data processing system has ...
Making sense of performance in in-memory computing frameworks for scientific data analysis: A case study of the spark system
AbstractOver the last five years, Apache Spark has become a major software platform for in-memory data analysis. Acknowledging its widespread use, we present a comprehensive study of system characteristics of Spark targeting scientific data ...
Highlights- We develop a benchmark, ArrayBench, for benchmarking scientific data analytics that process gene expression matrices using Spark and SciDB.
The Era of Big Spatial Data: Challenges and Opportunities
MDM '15: Proceedings of the 2015 16th IEEE International Conference on Mobile Data Management - Volume 02This seminar describes the state-of-the-art research in the area of big spatial data and it consists of four parts. Part I gives a background about big spatial data and the limitations of traditional systems in handling such data. Part II gives an ...
Comments