research-article

Spatio-Temporal Join on Apache Spark

Authors:
Randall T. Whitman

Environmental Systems Research Institute (Esri)

Environmental Systems Research Institute (Esri)
View Profile

,
Michael B. Park

Environmental Systems Research Institute (Esri)

Environmental Systems Research Institute (Esri)
View Profile

,
Bryan G. Marsh

Environmental Systems Research Institute (Esri)

Environmental Systems Research Institute (Esri)
View Profile

,
Erik G. Hoel

Environmental Systems Research Institute (Esri)

Environmental Systems Research Institute (Esri)
View Profile

SIGSPATIAL '17: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information SystemsNovember 2017Article No.: 20Pages 1–10https://doi.org/10.1145/3139958.3139963

Published:07 November 2017Publication History

SIGSPATIAL '17: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems

Pages 1–10

ABSTRACT

Effective processing of extremely large volumes of spatial data has led to many organizations employing distributed processing frameworks. Apache Spark is one such open-source framework that is enjoying widespread adoption. Within this data space, it is important to note that most of the observational data (i.e., data collected by sensors, either moving or stationary) has a temporal component, or timestamp. In order to perform advanced analytics and gain insights, the temporal component becomes equally important as the spatial and attribute components. In this paper, we detail several variants of a spatial join operation that addresses both spatial, temporal, and attribute-based joins. Our spatial join technique differs from other approaches in that it combines spatial, temporal, and attribute predicates in the join operator.

In addition, our spatio-temporal join algorithm and implementation differs from others in that it runs in commercial off-the-shelf (COTS) application. The users of this functionality are assumed to be GIS analysts with little if any knowledge of the implementation details of spatio-temporal joins or distributed processing. They are comfortable using simple tools that do not provide the ability to tweak the configuration of the

References

Abel, D. J., Ooi, B. C., Tan, K.-L., Power, R., and Yu, J. X. 1995. Spatial join strategies in distributed spatial DBMS. In Advances in Spatial Databases -- 4th International Symposium, SSD'95, vol. 1619 of Springer-Verlag Lecture Notes in Computer Science. Portland, ME, 348--367. Google ScholarDigital Library
Aji A., Wang, F., Vo H., Lee, R., Liu, Q., Zhang, X. and Saltz, J. 2013. Hadoop-GIS: a high performance spatial data warehousing system over mapreduce. In Proceedings of the VLDB Endowment, 6 (11), (pp. 1009--1020). Google ScholarDigital Library
Baig F., Mehrotra M., Vo H., Wang F., Saltz J., Kurc T. 2016. SparkGIS: efficient comparison and evaluation of algorithm results in tissue image analysis studies. In Biomedical Data Management and Graph Online Querying. Big-O(Q) 2015, DMAH 2015. Lecture Notes in Computer Science, vol 9579. Springer.Google Scholar
Brinkhoff, T., Kriegel, H. P., and Seeger, B. 1996. Parallel processing of spatial joins using r-trees. In Proceedings of the 12th International Conference on Data Engineering (pp. 258--265). IEEE. Google ScholarDigital Library
Dittrich, J. P., and Seeger, B. 2000. Data redundancy and duplicate detection in spatial join processing. In Data Engineering, 2000. Proceedings of the 16th International Conference on (pp. 535--546). IEEE. Google ScholarDigital Library
Du, Z., Zhao, X., Ye, X., Zhou, J., Zhang, F., and Liu, R. 2017. An effective high-performance multiway spatial join algorithm with spark. ISPRS International Journal of Geo-Information, 6 (4): 96.Google ScholarCross Ref
Eldawy, A., and Mokbel, M. F. 2015. SpatialHadoop: a mapreduce framework for spatial data. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on (pp. 1352--1363). IEEE.Google Scholar
Esri. 2013. GIS Tools for Hadoop. https://github.com/Esri/gis-tools-for-hadoop (referenced 2017/06).Google Scholar
Esri. 2016. ArcGIS GeoAnalytics Server. http://server.arcgis.com/en/server/latest/get-started/windows/what-is-arcgis-geoanalytics-server-.htm (referenced 2017/06).Google Scholar
Ester, M., Kriegel, H. P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 226--231. Google ScholarDigital Library
Gargantini, I. 1982. An effective way to represent quadtrees. Communications of the ACM 25, 12 (December 1982), 905--910. Google ScholarDigital Library
Guttman, A. 1984. "R-trees: a dynamic index structure for spatial searching," in Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, (pp. 47--57). Google ScholarDigital Library
Hagedorn, S., Götze, P., and Sattler, K. U. 2017. The STARK framework for spatio-temporal data analytics on spark. In Proceedings of the 17th Conference on Database Systems for Business, Technology, and the Web (BTW 2017), Stuttgart, Germany, March 2017.Google Scholar
Hjaltason, G., and Samet, H. 2002. Speeding up construction of PMR quadtree-based spatial indexes. VLDB Journal, 11, 2 (October 2002), 109--137. Google ScholarDigital Library
Hoel, E. and Samet, H. 1994. Data-parallel spatial join algorithms. In Proceedings of the 23rd International Conference on Parallel Processing. Vol. 3. St. Charles, IL, 227--234. Google ScholarDigital Library
Jacox, E. H., and Samet, H. 2007. Spatial join techniques. ACM Transactions on Database Systems (TODS), 32(1), 7. Google ScholarDigital Library
Kornacker, M., and Erickson, J. 2012. Cloudera Impala: Real Time Queries in Apache Hadoop, For Real. http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real.Google Scholar
Mamoulis, N. and Papadias, D. 2001. Multiway spatial joins. ACM Transactions on Database Systems 26, 4 (Dec.), 424--475. Google ScholarDigital Library
Nelson, R. C., and Samet, H. 1986. A consistent hierarchical representation for vector data. In ACM SIGGRAPH Computer Graphics, 20 (4), pp. 197--206). ACM. Google ScholarDigital Library
Nievergelt, J., Hinterberger, H., and Sevcik, K. C. 1984. The grid file: An adaptable, symmetric multikey file structure. ACM Transactions on Database Systems (TODS), 9(1), 38--71. Google ScholarDigital Library
Orenstein, Jack A. "Multidimensional tries used for associative searching." Information Processing Letters 14.4 (1982): 150--157.Google Scholar
Raad, M. 2013. BigData Spatial Joins, Blog post. http://thunderheadxpler.blogspot.com/2013/10/bigdata-spatial-joins.html (referenced 2017/06).Google Scholar
Sellis, T. K., Roussopoulos, N., and Faloutsos, C. 1987. The R+-tree: a dynamic index for multi-dimensional objects, in Proceedings of the 13th International Conference on Very Large Data Bases (VLDB), (pp. 507--518). Google ScholarDigital Library
Sriharsha, R., 2015. Magellan: geospatial analytics on spark, https://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/ (referenced 2017/06).Google Scholar
Tang, MingJie, Yongyang Yu, Qutaibah M. Malluhi, Mourad Ouzzani and Walid G. Aref. "LocationSpark: A Distributed In-Memory Data Management System for Big Spatial Data." PVLDB 9 (2016): 1565--1568. Google ScholarDigital Library
Valduriez, P., and Gardarin, G. 1984. Join and semijoin algorithms for a multiprocessor database machine. ACM Transactions on Database Systems (TODS), 9(1), 133--161. Google ScholarDigital Library
White, T. 2009. Hadoop: The Definitive Guide (1st ed.). O'Reilly Media, Inc., Sebastopol, CA, USA. Google ScholarDigital Library
Whitman, R. T., Park, M. B., Ambrose, S. M., and Hoel, E. G. 2014. Spatial indexing and analytics on hadoop. In Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (pp. 73--82). ACM. Google ScholarDigital Library
Dong Xie, Feifei Li, Bin Yao, Gefei Li, Liang Zhou, and Minyi Guo. 2016. Simba: Efficient In-Memory Spatial Analytics. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). ACM, New York, NY, USA, 1071--1085. Google ScholarDigital Library
You, S., Zhang, J., and Gruenwald, L. 2015. Large-scale spatial join query processing in cloud. In Data Engineering Workshops (ICDEW), 2015 31st IEEE International Conference on (pp. 34--41). IEEE.Google Scholar
Yu, J., Wu, J., and Sarwat, M. 2015. Geospark: A cluster computing framework for processing large-scale spatial data. In Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems (p. 70). ACM. Google ScholarDigital Library
Zaharia, M., Chowdhury, M., Franklin, M., Shenker, S., and Stoica, I. 2010. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud '10), Boston, June 2010. Google ScholarDigital Library
Zhang, S., Han, J., Liu, Z., Wang, K., and Xu, Z. 2009. SJMR: Parallelizing spatial join with mapreduce on clusters. In Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE international conference on (pp. 1--8). IEEE.Google Scholar
Zhong, Yunqin, Jizhong Han, Tieying Zhang, Zhenhua Li, Jinyun Fang and Guihai Chen. "Towards Parallel Spatial Query Processing for Big Spatial Data." 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (2012): 2085--2094. Google ScholarDigital Library

Index Terms

Spatio-Temporal Join on Apache Spark
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
        Join algorithms
  2. Information systems applications
    1. Spatial-temporal systems
      1. Geographic information systems

Recommendations

Join Algorithms under Apache Spark: Revisited
ICCTA '19: Proceedings of the 2019 5th International Conference on Computer and Technology Applications

Currently, we are dealing with large scale applications, which in turn generate massive amount of data and information. Large amount of data often requires processing algorithms using massive parallelism, where the main performance metrics is the ...
Read More
Distributed Spatial and Spatio-Temporal Join on Apache Spark
Special Issue on SIGSPATIAL 2017

Effective processing of extremely large volumes of spatial data has led to many organizations employing distributed processing frameworks. Apache Spark is one such open source framework that is enjoying widespread adoption. Within this data space, it is ...
Read More
Impact of Memory Size on Bigdata Processing based on Hadoop and Spark
RACS '17: Proceedings of the International Conference on Research in Adaptive and Convergent Systems

Hadoop and Spark are well-known big data processing platforms. The main technologies of Hadoop are Hadoop Distributed File System and MapReduce processing. Hadoop stores intermediary data on Hadoop Distributed File System, which is a disk-based ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGSPATIAL '17: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
November 2017
677 pages
ISBN:9781450354905
DOI:10.1145/3139958
Editors:
Erik Hoel,
Shawn Newsam,
Siva Ravada,
Roberto Tamassia,
Goce Trajcevski
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 November 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
HDFS
Hadoop
Spark
Spatial join
distributed processing
spatio-temporal join
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
SIGSPATIAL '17 Paper Acceptance Rate39of193submissions,20%Overall Acceptance Rate220of1,116submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 15
  Total Citations
  View Citations
- 514
  Total Downloads
- Downloads (Last 12 months)28
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Spatio-Temporal Join on Apache Spark

SIGSPATIAL '17: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Join Algorithms under Apache Spark: Revisited

Distributed Spatial and Spatio-Temporal Join on Apache Spark

Impact of Memory Size on Bigdata Processing based on Hadoop and Spark