short-paper

GeoSpark: a cluster computing framework for processing large-scale spatial data

Authors:
Jia Yu

Arizona State University, Tempe, AZ

Arizona State University, Tempe, AZ
View Profile

,
Jinxuan Wu

Arizona State University, Tempe, AZ

Arizona State University, Tempe, AZ
View Profile

,
Mohamed Sarwat

Arizona State University, Tempe, AZ

Arizona State University, Tempe, AZ
View Profile

SIGSPATIAL '15: Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information SystemsNovember 2015Article No.: 70Pages 1–4https://doi.org/10.1145/2820783.2820860

Published:03 November 2015Publication History

SIGSPATIAL '15: Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems

Pages 1–4

ABSTRACT

This paper introduces GeoSpark an in-memory cluster computing framework for processing large-scale spatial data. GeoSpark consists of three layers: Apache Spark Layer, Spatial RDD Layer and Spatial Query Processing Layer. Apache Spark Layer provides basic Spark functionalities that include loading / storing data to disk as well as regular RDD operations. Spatial RDD Layer consists of three novel Spatial Resilient Distributed Datasets (SRDDs) which extend regular Apache Spark RDDs to support geometrical and spatial objects. GeoSpark provides a geometrical operations library that accesses Spatial RDDs to perform basic geometrical operations (e.g., Overlap, Intersect). System users can leverage the newly defined SRDDs to effectively develop spatial data processing programs in Spark. The Spatial Query Processing Layer efficiently executes spatial query processing algorithms (e.g., Spatial Range, Join, KNN query) on SRDDs. GeoSpark also allows users to create a spatial index (e.g., R-tree, Quad-tree) that boosts spatial data processing performance in each SRDD partition. Preliminary experiments show that GeoSpark achieves better run time performance than its Hadoop-based counterparts (e.g., SpatialHadoop).

References

A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. H. Saltz. Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce. PVLDB, 6(11):1009--1020, 2013. Google ScholarDigital Library
A. Eldawy and M. F. Mokbel. A demonstration of spatialhadoop: An efficient mapreduce framework for spatial data. PVLDB, 6(12):1230--1233, 2013. Google ScholarDigital Library
A. Guttman. R-trees: a dynamic index structure for spatial searching. In SIGMOD, 1984. Google ScholarDigital Library
J. Lu and R. H. Guting. Parallel Secondo: Boosting Database Engines with Hadoop. In ICPADS, pages 738--743, 2012. Google ScholarDigital Library
G. Luo, J. F. Naughton, and C. J. Ellmann. A non-blocking parallel spatial join algorithm. In Data Engineering, 2002. Proceedings. 18th International Conference on, pages 697--705. IEEE, 2002. Google ScholarDigital Library
S. Nishimura, S. Das, D. Agrawal, and A. E. Abbadi. MD-Hbase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services. In MDM, pages 7--16, 2011. Google ScholarDigital Library
N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In ACM SIGMOD record, volume 24, pages 71--79. ACM, 1995. Google ScholarDigital Library
H. Samet. The quadtree and related hierarchical data structures. ACM Computing Surveys (CSUR), 16(2):187--260, 1984. Google ScholarDigital Library
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In NSDI, pages 15--28, 2012. Google ScholarDigital Library
X. Zhou, D. J. Abel, and D. Truffet. Data partitioning for parallel spatial join processing. Geoinformatica, 2(2):175--204, 1998. Google ScholarDigital Library

Index Terms

GeoSpark: a cluster computing framework for processing large-scale spatial data
1. Human-centered computing
  1. Visualization
    1. Visualization application domains
      1. Geographic visualization
2. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs
  2. Information systems applications
    1. Spatial-temporal systems

Recommendations

Spatial Data Wrangling with GeoSpark: A Step by Step Tutorial
SpatialAPI'19: Proceedings of the 1st ACM SIGSPATIAL International Workshop on Geospatial Data Access and Processing APIs

This tutorial is expected to deliver a comprehensive study and hands-on tutorial of how GeoSpark incorporates Spark to uphold massive-scale spatial data. We also want this tutorial to serve as an introductory course that teaches the audience the basic ...
Read More
A comparative experimental study of distributed storage engines for big spatial data processing using GeoSpark
Abstract
With increasing numbers of GPS-equipped mobile devices, we are witnessing a deluge of spatial information that needs to be effectively and efficiently managed. Even though there are several distributed spatial data processing systems such as ...
Read More
Conquering big data with spark and BDAS
SIGMETRICS '14: The 2014 ACM international conference on Measurement and modeling of computer systems

Today, big and small organizations alike collect huge amounts of data, and they do so with one goal in mind: extract "value" through sophisticated exploratory analysis, and use it as the basis to make decisions as varied as personalized treatment and ad ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGSPATIAL '15: Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems
November 2015
646 pages
ISBN:9781450339674
DOI:10.1145/2820783
General Chairs:
Mohamed Ali
University of Washington, Tacoma
,
Yan Huang
University of North Texas
,
Program Chairs:
Michael Gertz
Heidelberg University, Germany
,
Matthias Renz
LMU Munich, Germany
,
Jagan Sankaranarayanan
NEC Labs
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 November 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cluster computing
large-scale data
spatial data
Qualifiers
- short-paper
Conference

Acceptance Rates
SIGSPATIAL '15 Paper Acceptance Rate38of212submissions,18%Overall Acceptance Rate220of1,116submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 238
  Total Citations
  View Citations
- 1,427
  Total Downloads
- Downloads (Last 12 months)111
- Downloads (Last 6 weeks)21
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

GeoSpark: a cluster computing framework for processing large-scale spatial data

SIGSPATIAL '15: Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Spatial Data Wrangling with GeoSpark: A Step by Step Tutorial

A comparative experimental study of distributed storage engines for big spatial data processing using GeoSpark

Conquering big data with spark and BDAS

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

GeoSpark: a cluster computing framework for processing large-scale spatial data

SIGSPATIAL '15: Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Spatial Data Wrangling with GeoSpark: A Step by Step Tutorial

A comparative experimental study of distributed storage engines for big spatial data processing using GeoSpark

Conquering big data with spark and BDAS

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media