ABSTRACT
This paper introduces GeoSpark an in-memory cluster computing framework for processing large-scale spatial data. GeoSpark consists of three layers: Apache Spark Layer, Spatial RDD Layer and Spatial Query Processing Layer. Apache Spark Layer provides basic Spark functionalities that include loading / storing data to disk as well as regular RDD operations. Spatial RDD Layer consists of three novel Spatial Resilient Distributed Datasets (SRDDs) which extend regular Apache Spark RDDs to support geometrical and spatial objects. GeoSpark provides a geometrical operations library that accesses Spatial RDDs to perform basic geometrical operations (e.g., Overlap, Intersect). System users can leverage the newly defined SRDDs to effectively develop spatial data processing programs in Spark. The Spatial Query Processing Layer efficiently executes spatial query processing algorithms (e.g., Spatial Range, Join, KNN query) on SRDDs. GeoSpark also allows users to create a spatial index (e.g., R-tree, Quad-tree) that boosts spatial data processing performance in each SRDD partition. Preliminary experiments show that GeoSpark achieves better run time performance than its Hadoop-based counterparts (e.g., SpatialHadoop).
- A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. H. Saltz. Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce. PVLDB, 6(11):1009--1020, 2013. Google ScholarDigital Library
- A. Eldawy and M. F. Mokbel. A demonstration of spatialhadoop: An efficient mapreduce framework for spatial data. PVLDB, 6(12):1230--1233, 2013. Google ScholarDigital Library
- A. Guttman. R-trees: a dynamic index structure for spatial searching. In SIGMOD, 1984. Google ScholarDigital Library
- J. Lu and R. H. Guting. Parallel Secondo: Boosting Database Engines with Hadoop. In ICPADS, pages 738--743, 2012. Google ScholarDigital Library
- G. Luo, J. F. Naughton, and C. J. Ellmann. A non-blocking parallel spatial join algorithm. In Data Engineering, 2002. Proceedings. 18th International Conference on, pages 697--705. IEEE, 2002. Google ScholarDigital Library
- S. Nishimura, S. Das, D. Agrawal, and A. E. Abbadi. MD-Hbase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services. In MDM, pages 7--16, 2011. Google ScholarDigital Library
- N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In ACM SIGMOD record, volume 24, pages 71--79. ACM, 1995. Google ScholarDigital Library
- H. Samet. The quadtree and related hierarchical data structures. ACM Computing Surveys (CSUR), 16(2):187--260, 1984. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In NSDI, pages 15--28, 2012. Google ScholarDigital Library
- X. Zhou, D. J. Abel, and D. Truffet. Data partitioning for parallel spatial join processing. Geoinformatica, 2(2):175--204, 1998. Google ScholarDigital Library
Index Terms
- GeoSpark: a cluster computing framework for processing large-scale spatial data
Recommendations
Spatial Data Wrangling with GeoSpark: A Step by Step Tutorial
SpatialAPI'19: Proceedings of the 1st ACM SIGSPATIAL International Workshop on Geospatial Data Access and Processing APIsThis tutorial is expected to deliver a comprehensive study and hands-on tutorial of how GeoSpark incorporates Spark to uphold massive-scale spatial data. We also want this tutorial to serve as an introductory course that teaches the audience the basic ...
A comparative experimental study of distributed storage engines for big spatial data processing using GeoSpark
AbstractWith increasing numbers of GPS-equipped mobile devices, we are witnessing a deluge of spatial information that needs to be effectively and efficiently managed. Even though there are several distributed spatial data processing systems such as ...
Conquering big data with spark and BDAS
SIGMETRICS '14: The 2014 ACM international conference on Measurement and modeling of computer systemsToday, big and small organizations alike collect huge amounts of data, and they do so with one goal in mind: extract "value" through sophisticated exploratory analysis, and use it as the basis to make decisions as varied as personalized treatment and ad ...
Comments