skip to main content
10.1145/2996913.2997013acmotherconferencesArticle/Chapter ViewAbstractPublication PagesgisConference Proceedingsconference-collections
research-article
Open Access

Gloria: a batch friendly indexing and storage framework

Published:31 October 2016Publication History

ABSTRACT

Indexing and delivering spatial data to a massive user base composed of over a billion devices around the world stretches the limits of traditional infrastructure and operational tools. For instance, offline bulk indexing and loading fall short of viable solutions when it comes to data at scale; Integration with distributed systems such as Apache Hadoop© or Spark© is sparse, while data loading is often performed in a sub-optimal fashion by relying on intermediate file formats.

We present in this paper an approach toward a hybrid on- line/offline indexing framework called Gloria that has been running in production settings for the past year at over 350k requests per seconds with lookup latencies under 5μs. The resulting output is an in-memory key-value store and we show that by leveraging higher level MapReduce [7] constructs as defined in FlumeJava [5], Gloria can achieve large scale key-value offline indexing in a fraction of the time required by traditional datastores while maintaining similar operational performance. Gloria also provides a spatial layer based on improvements to pointer-less quadtrees [12] and locational identifiers we call shift key that reduces the nearest neighbor problem in spatial data to simple key-value lookups. Shift keys have shown to outperform well established solutions such as Google S2 with locational key operations.

References

  1. A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. Saltz. Hadoop gis: a high performance spatial data warehousing system over mapreduce. Proceedings of the VLDB Endowment, 6(11):1009--1020, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. Bagwell. Ideal hash trees. Technical report, 2001.Google ScholarGoogle Scholar
  3. A. Barbuzzi, P. Michiardi, E. Biersack, and G. Boggia. Parallel bulk insertion for large-scale analytics applications. In Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware, pages 27--31. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422--426, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Chambers, A. Raniwala, F. Perry, S. Adams, R. Henry, R. Bradshaw, and Nathan. Flumejava: Easy, efficient data-parallel pipelines. pages 363--375, 2010. URL: http://dl.acm.org/citation.cfm?id=1806638.Google ScholarGoogle Scholar
  6. A. Crunch. https://crunch.apache.org, 2016. [online, accessed 20-April-2016].Google ScholarGoogle Scholar
  7. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 10--10, Berkeley, CA, USA, 2004. USENIX Association. URL: http://dl.acm.org/citation.cfm?id=1251254.1251264.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon's highly available key-value store. In ACM SIGOPS Operating Systems Review, volume 41, pages 205--220. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Eldawy and M. F. Mokbel. Spatialhadoop: A mapreduce framework for spatial data. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on, pages 1352--1363. IEEE, 2015. Google ScholarGoogle ScholarCross RefCross Ref
  10. L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Transactions on Networking (TON), 8(3):281--293, 2000.Google ScholarGoogle Scholar
  11. R. A. Finkel and J. L. Bentley. Quad trees a data structure for retrieval on composite keys. Acta informatica, 4(1):1--9, 1974. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. I. Gargantini. An effective way to represent quadtrees. Commun. ACM, 25(12):905--910, Dec. 1982. URL: http://doi.acm.org/10.1145/358728.358741, doi:10.1145/358728.358741. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. H. Gonnet. Expected length of the longest probe sequence in hash code searching. J. ACM, 28(2):289--304, Apr. 1981. URL: http://doi.acm.org/10.1145/322248.322254, doi:10.1145/322248.322254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Guo, J. Wu, H. Chen, Y. Yuan, and X. Luo. The dynamic bloom filters. Knowledge and Data Engineering, IEEE Transactions on, 22(1):120--133, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Gupta, A. Wildani, E. L. Miller, D. Rosenthal, I. F. Adams, C. Strong, and A. Hospodor. An economic perspective of disk vs. flash media in archival storage. In Modelling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), 2014 IEEE 22nd International Symposium on, pages 249--254. IEEE, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Hadoop. https://hadoop.apache.org, 2016. [online, accessed 20-April-2016].Google ScholarGoogle Scholar
  17. M. Herlihy, N. Shavit, and M. Tzafrir. Hopscotch hashing. In Distributed Computing, pages 350--364. Springer, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. R. Hjaltason and H. Samet. Speeding up construction of pmr quadtree-based spatial indexes. The VLDB Journal-The International Journal on Very Large Data Bases, 11(2): 109--137, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. L. Horowitz and T. Pavlidis. Picture segmentation by a tree traversal algorithm. Journal of the ACM (JACM), 23(2):368--388, 1976. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. G. Inc. https://github.com/sparsehash/sparsehash, 2016. [online, accessed 20-April-2016].Google ScholarGoogle Scholar
  21. M. Mitzenmacher and E. Upfal. Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge University Press, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Oracle. https://docs.oracle.com/javase/7/docs/api/java/util/IdentityHashMap.html, 2016. [online, accessed 20-April-2016].Google ScholarGoogle Scholar
  23. R. Pagh and F. F. Rodler. Cuckoo hashing. Journal of Algorithms, 51(2):122--144, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Y. Perl, A. Itai, and H. Avni. Interpolation search-a log log n search. Communications of the ACM, 21(7):550--553, 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. P. Rigaux, M. Scholl, and A. Voisard. Spatial databases: with application to GIS. Morgan Kaufmann, 2001.Google ScholarGoogle Scholar
  26. H. Samet. The quadtree and related hierarchical data structures. ACM Computing Surveys (CSUR), 16(2):187--260, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Silberstein, B. F. Cooper, U. Srivastava, E. Vee, R. Yerneni, and R. Ramakrishnan. Efficient bulk insertion into a distributed ordered table. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 765--778. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. Sumbaly, J. Kreps, L. Gao, A. Feinberg, C. Soman, and S. Shah. Serving large-scale batch computed data with project voldemort. In Proceedings of the 10th USENIX conference on File and Storage Technologies, pages 18--18. USENIX Association, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Gloria: a batch friendly indexing and storage framework

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          SIGSPACIAL '16: Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
          October 2016
          649 pages
          ISBN:9781450345897
          DOI:10.1145/2996913

          Copyright © 2016 Owner/Author

          Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 31 October 2016

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          SIGSPACIAL '16 Paper Acceptance Rate40of216submissions,19%Overall Acceptance Rate220of1,116submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader