ABSTRACT
Effective processing of extremely large volumes of spatial data has led to many organizations employing distributed processing frameworks. Apache Spark is one such open-source framework that is enjoying widespread adoption. Within this data space, it is important to note that most of the observational data (i.e., data collected by sensors, either moving or stationary) has a temporal component, or timestamp. In order to perform advanced analytics and gain insights, the temporal component becomes equally important as the spatial and attribute components. In this paper, we detail several variants of a spatial join operation that addresses both spatial, temporal, and attribute-based joins. Our spatial join technique differs from other approaches in that it combines spatial, temporal, and attribute predicates in the join operator.
In addition, our spatio-temporal join algorithm and implementation differs from others in that it runs in commercial off-the-shelf (COTS) application. The users of this functionality are assumed to be GIS analysts with little if any knowledge of the implementation details of spatio-temporal joins or distributed processing. They are comfortable using simple tools that do not provide the ability to tweak the configuration of the
- Abel, D. J., Ooi, B. C., Tan, K.-L., Power, R., and Yu, J. X. 1995. Spatial join strategies in distributed spatial DBMS. In Advances in Spatial Databases -- 4th International Symposium, SSD'95, vol. 1619 of Springer-Verlag Lecture Notes in Computer Science. Portland, ME, 348--367. Google ScholarDigital Library
- Aji A., Wang, F., Vo H., Lee, R., Liu, Q., Zhang, X. and Saltz, J. 2013. Hadoop-GIS: a high performance spatial data warehousing system over mapreduce. In Proceedings of the VLDB Endowment, 6 (11), (pp. 1009--1020). Google ScholarDigital Library
- Baig F., Mehrotra M., Vo H., Wang F., Saltz J., Kurc T. 2016. SparkGIS: efficient comparison and evaluation of algorithm results in tissue image analysis studies. In Biomedical Data Management and Graph Online Querying. Big-O(Q) 2015, DMAH 2015. Lecture Notes in Computer Science, vol 9579. Springer.Google Scholar
- Brinkhoff, T., Kriegel, H. P., and Seeger, B. 1996. Parallel processing of spatial joins using r-trees. In Proceedings of the 12th International Conference on Data Engineering (pp. 258--265). IEEE. Google ScholarDigital Library
- Dittrich, J. P., and Seeger, B. 2000. Data redundancy and duplicate detection in spatial join processing. In Data Engineering, 2000. Proceedings of the 16th International Conference on (pp. 535--546). IEEE. Google ScholarDigital Library
- Du, Z., Zhao, X., Ye, X., Zhou, J., Zhang, F., and Liu, R. 2017. An effective high-performance multiway spatial join algorithm with spark. ISPRS International Journal of Geo-Information, 6 (4): 96.Google ScholarCross Ref
- Eldawy, A., and Mokbel, M. F. 2015. SpatialHadoop: a mapreduce framework for spatial data. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on (pp. 1352--1363). IEEE.Google Scholar
- Esri. 2013. GIS Tools for Hadoop. https://github.com/Esri/gis-tools-for-hadoop (referenced 2017/06).Google Scholar
- Esri. 2016. ArcGIS GeoAnalytics Server. http://server.arcgis.com/en/server/latest/get-started/windows/what-is-arcgis-geoanalytics-server-.htm (referenced 2017/06).Google Scholar
- Ester, M., Kriegel, H. P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 226--231. Google ScholarDigital Library
- Gargantini, I. 1982. An effective way to represent quadtrees. Communications of the ACM 25, 12 (December 1982), 905--910. Google ScholarDigital Library
- Guttman, A. 1984. "R-trees: a dynamic index structure for spatial searching," in Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, (pp. 47--57). Google ScholarDigital Library
- Hagedorn, S., Götze, P., and Sattler, K. U. 2017. The STARK framework for spatio-temporal data analytics on spark. In Proceedings of the 17th Conference on Database Systems for Business, Technology, and the Web (BTW 2017), Stuttgart, Germany, March 2017.Google Scholar
- Hjaltason, G., and Samet, H. 2002. Speeding up construction of PMR quadtree-based spatial indexes. VLDB Journal, 11, 2 (October 2002), 109--137. Google ScholarDigital Library
- Hoel, E. and Samet, H. 1994. Data-parallel spatial join algorithms. In Proceedings of the 23rd International Conference on Parallel Processing. Vol. 3. St. Charles, IL, 227--234. Google ScholarDigital Library
- Jacox, E. H., and Samet, H. 2007. Spatial join techniques. ACM Transactions on Database Systems (TODS), 32(1), 7. Google ScholarDigital Library
- Kornacker, M., and Erickson, J. 2012. Cloudera Impala: Real Time Queries in Apache Hadoop, For Real. http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real.Google Scholar
- Mamoulis, N. and Papadias, D. 2001. Multiway spatial joins. ACM Transactions on Database Systems 26, 4 (Dec.), 424--475. Google ScholarDigital Library
- Nelson, R. C., and Samet, H. 1986. A consistent hierarchical representation for vector data. In ACM SIGGRAPH Computer Graphics, 20 (4), pp. 197--206). ACM. Google ScholarDigital Library
- Nievergelt, J., Hinterberger, H., and Sevcik, K. C. 1984. The grid file: An adaptable, symmetric multikey file structure. ACM Transactions on Database Systems (TODS), 9(1), 38--71. Google ScholarDigital Library
- Orenstein, Jack A. "Multidimensional tries used for associative searching." Information Processing Letters 14.4 (1982): 150--157.Google Scholar
- Raad, M. 2013. BigData Spatial Joins, Blog post. http://thunderheadxpler.blogspot.com/2013/10/bigdata-spatial-joins.html (referenced 2017/06).Google Scholar
- Sellis, T. K., Roussopoulos, N., and Faloutsos, C. 1987. The R+-tree: a dynamic index for multi-dimensional objects, in Proceedings of the 13th International Conference on Very Large Data Bases (VLDB), (pp. 507--518). Google ScholarDigital Library
- Sriharsha, R., 2015. Magellan: geospatial analytics on spark, https://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/ (referenced 2017/06).Google Scholar
- Tang, MingJie, Yongyang Yu, Qutaibah M. Malluhi, Mourad Ouzzani and Walid G. Aref. "LocationSpark: A Distributed In-Memory Data Management System for Big Spatial Data." PVLDB 9 (2016): 1565--1568. Google ScholarDigital Library
- Valduriez, P., and Gardarin, G. 1984. Join and semijoin algorithms for a multiprocessor database machine. ACM Transactions on Database Systems (TODS), 9(1), 133--161. Google ScholarDigital Library
- White, T. 2009. Hadoop: The Definitive Guide (1st ed.). O'Reilly Media, Inc., Sebastopol, CA, USA. Google ScholarDigital Library
- Whitman, R. T., Park, M. B., Ambrose, S. M., and Hoel, E. G. 2014. Spatial indexing and analytics on hadoop. In Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (pp. 73--82). ACM. Google ScholarDigital Library
- Dong Xie, Feifei Li, Bin Yao, Gefei Li, Liang Zhou, and Minyi Guo. 2016. Simba: Efficient In-Memory Spatial Analytics. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). ACM, New York, NY, USA, 1071--1085. Google ScholarDigital Library
- You, S., Zhang, J., and Gruenwald, L. 2015. Large-scale spatial join query processing in cloud. In Data Engineering Workshops (ICDEW), 2015 31st IEEE International Conference on (pp. 34--41). IEEE.Google Scholar
- Yu, J., Wu, J., and Sarwat, M. 2015. Geospark: A cluster computing framework for processing large-scale spatial data. In Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems (p. 70). ACM. Google ScholarDigital Library
- Zaharia, M., Chowdhury, M., Franklin, M., Shenker, S., and Stoica, I. 2010. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud '10), Boston, June 2010. Google ScholarDigital Library
- Zhang, S., Han, J., Liu, Z., Wang, K., and Xu, Z. 2009. SJMR: Parallelizing spatial join with mapreduce on clusters. In Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE international conference on (pp. 1--8). IEEE.Google Scholar
- Zhong, Yunqin, Jizhong Han, Tieying Zhang, Zhenhua Li, Jinyun Fang and Guihai Chen. "Towards Parallel Spatial Query Processing for Big Spatial Data." 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (2012): 2085--2094. Google ScholarDigital Library
Index Terms
- Spatio-Temporal Join on Apache Spark
Recommendations
Join Algorithms under Apache Spark: Revisited
ICCTA '19: Proceedings of the 2019 5th International Conference on Computer and Technology ApplicationsCurrently, we are dealing with large scale applications, which in turn generate massive amount of data and information. Large amount of data often requires processing algorithms using massive parallelism, where the main performance metrics is the ...
Distributed Spatial and Spatio-Temporal Join on Apache Spark
Special Issue on SIGSPATIAL 2017Effective processing of extremely large volumes of spatial data has led to many organizations employing distributed processing frameworks. Apache Spark is one such open source framework that is enjoying widespread adoption. Within this data space, it is ...
Impact of Memory Size on Bigdata Processing based on Hadoop and Spark
RACS '17: Proceedings of the International Conference on Research in Adaptive and Convergent SystemsHadoop and Spark are well-known big data processing platforms. The main technologies of Hadoop are Hadoop Distributed File System and MapReduce processing. Hadoop stores intermediary data on Hadoop Distributed File System, which is a disk-based ...
Comments