ABSTRACT
Indexing and delivering spatial data to a massive user base composed of over a billion devices around the world stretches the limits of traditional infrastructure and operational tools. For instance, offline bulk indexing and loading fall short of viable solutions when it comes to data at scale; Integration with distributed systems such as Apache Hadoop© or Spark© is sparse, while data loading is often performed in a sub-optimal fashion by relying on intermediate file formats.
We present in this paper an approach toward a hybrid on- line/offline indexing framework called Gloria that has been running in production settings for the past year at over 350k requests per seconds with lookup latencies under 5μs. The resulting output is an in-memory key-value store and we show that by leveraging higher level MapReduce [7] constructs as defined in FlumeJava [5], Gloria can achieve large scale key-value offline indexing in a fraction of the time required by traditional datastores while maintaining similar operational performance. Gloria also provides a spatial layer based on improvements to pointer-less quadtrees [12] and locational identifiers we call shift key that reduces the nearest neighbor problem in spatial data to simple key-value lookups. Shift keys have shown to outperform well established solutions such as Google S2 with locational key operations.
- A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. Saltz. Hadoop gis: a high performance spatial data warehousing system over mapreduce. Proceedings of the VLDB Endowment, 6(11):1009--1020, 2013. Google ScholarDigital Library
- P. Bagwell. Ideal hash trees. Technical report, 2001.Google Scholar
- A. Barbuzzi, P. Michiardi, E. Biersack, and G. Boggia. Parallel bulk insertion for large-scale analytics applications. In Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware, pages 27--31. ACM, 2010. Google ScholarDigital Library
- B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422--426, 1970. Google ScholarDigital Library
- C. Chambers, A. Raniwala, F. Perry, S. Adams, R. Henry, R. Bradshaw, and Nathan. Flumejava: Easy, efficient data-parallel pipelines. pages 363--375, 2010. URL: http://dl.acm.org/citation.cfm?id=1806638.Google Scholar
- A. Crunch. https://crunch.apache.org, 2016. [online, accessed 20-April-2016].Google Scholar
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 10--10, Berkeley, CA, USA, 2004. USENIX Association. URL: http://dl.acm.org/citation.cfm?id=1251254.1251264.Google ScholarDigital Library
- G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon's highly available key-value store. In ACM SIGOPS Operating Systems Review, volume 41, pages 205--220. ACM, 2007. Google ScholarDigital Library
- A. Eldawy and M. F. Mokbel. Spatialhadoop: A mapreduce framework for spatial data. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on, pages 1352--1363. IEEE, 2015. Google ScholarCross Ref
- L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Transactions on Networking (TON), 8(3):281--293, 2000.Google Scholar
- R. A. Finkel and J. L. Bentley. Quad trees a data structure for retrieval on composite keys. Acta informatica, 4(1):1--9, 1974. Google ScholarDigital Library
- I. Gargantini. An effective way to represent quadtrees. Commun. ACM, 25(12):905--910, Dec. 1982. URL: http://doi.acm.org/10.1145/358728.358741, doi:10.1145/358728.358741. Google ScholarDigital Library
- G. H. Gonnet. Expected length of the longest probe sequence in hash code searching. J. ACM, 28(2):289--304, Apr. 1981. URL: http://doi.acm.org/10.1145/322248.322254, doi:10.1145/322248.322254. Google ScholarDigital Library
- D. Guo, J. Wu, H. Chen, Y. Yuan, and X. Luo. The dynamic bloom filters. Knowledge and Data Engineering, IEEE Transactions on, 22(1):120--133, 2010.Google ScholarDigital Library
- P. Gupta, A. Wildani, E. L. Miller, D. Rosenthal, I. F. Adams, C. Strong, and A. Hospodor. An economic perspective of disk vs. flash media in archival storage. In Modelling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), 2014 IEEE 22nd International Symposium on, pages 249--254. IEEE, 2014.Google ScholarDigital Library
- A. Hadoop. https://hadoop.apache.org, 2016. [online, accessed 20-April-2016].Google Scholar
- M. Herlihy, N. Shavit, and M. Tzafrir. Hopscotch hashing. In Distributed Computing, pages 350--364. Springer, 2008. Google ScholarDigital Library
- G. R. Hjaltason and H. Samet. Speeding up construction of pmr quadtree-based spatial indexes. The VLDB Journal-The International Journal on Very Large Data Bases, 11(2): 109--137, 2002. Google ScholarDigital Library
- S. L. Horowitz and T. Pavlidis. Picture segmentation by a tree traversal algorithm. Journal of the ACM (JACM), 23(2):368--388, 1976. Google ScholarDigital Library
- G. Inc. https://github.com/sparsehash/sparsehash, 2016. [online, accessed 20-April-2016].Google Scholar
- M. Mitzenmacher and E. Upfal. Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge University Press, 2005. Google ScholarDigital Library
- Oracle. https://docs.oracle.com/javase/7/docs/api/java/util/IdentityHashMap.html, 2016. [online, accessed 20-April-2016].Google Scholar
- R. Pagh and F. F. Rodler. Cuckoo hashing. Journal of Algorithms, 51(2):122--144, 2004. Google ScholarDigital Library
- Y. Perl, A. Itai, and H. Avni. Interpolation search-a log log n search. Communications of the ACM, 21(7):550--553, 1978. Google ScholarDigital Library
- P. Rigaux, M. Scholl, and A. Voisard. Spatial databases: with application to GIS. Morgan Kaufmann, 2001.Google Scholar
- H. Samet. The quadtree and related hierarchical data structures. ACM Computing Surveys (CSUR), 16(2):187--260, 1984. Google ScholarDigital Library
- A. Silberstein, B. F. Cooper, U. Srivastava, E. Vee, R. Yerneni, and R. Ramakrishnan. Efficient bulk insertion into a distributed ordered table. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 765--778. ACM, 2008. Google ScholarDigital Library
- R. Sumbaly, J. Kreps, L. Gao, A. Feinberg, C. Soman, and S. Shah. Serving large-scale batch computed data with project voldemort. In Proceedings of the 10th USENIX conference on File and Storage Technologies, pages 18--18. USENIX Association, 2012.Google ScholarDigital Library
Index Terms
- Gloria: a batch friendly indexing and storage framework
Recommendations
Inverted Grid-Based kNN Query Processing with MapReduce
CHINAGRID '12: Proceedings of the 2012 Seventh ChinaGrid Annual ConferenceWith the increasing availability of LBS (Location Based Services) and mobile internet, the amount of spatial data is growing larger and larger. It poses new requirements and challenges towards cloud environments, such as how to accomplish efficient ...
Hybrid index structures for location-based web search
CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge managementThere is more and more commercial and research interest in location-based web search, i.e. finding web content whose topic is related to a particular place or region. In this type of search, location information should be indexed as well as text ...
An Efficient Distributed Index for Geospatial Databases
DEXA 2015: Proceedings, Part I, of the 26th International Conference on Database and Expert Systems Applications - Volume 9261The recent and rapid growth of GPS-enabled devices has resulted in an explosion of spatial data. There are three main challenges for managing and querying such data: the massive volume of data, the need for a high insertion throughput and enabling real-...
Comments