skip to main content
10.1145/3083187.3083200acmconferencesArticle/Chapter ViewAbstractPublication PagesmmsysConference Proceedingsconference-collections
research-article

Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark

Published:20 June 2017Publication History

ABSTRACT

Computing power has now become abundant with multi-core machines, grids and clouds, but it remains a challenge to harness the available power and move towards gracefully handling web-scale datasets. Several researchers have used automatically distributed computing frameworks, notably Hadoop and Spark, for processing multimedia material, but mostly using small collections on small clusters. In this paper, we describe the engineering process for a prototype of a (near) web-scale multimedia service using the Spark framework running on the AWS cloud service. We present experimental results using up to 43 billion SIFT feature vectors from the public YFCC 100M collection, making this the largest high-dimensional feature vector collection reported in the literature. The design of the prototype and performance results demonstrate both the flexibility and scalability of the Spark framework for implementing multimedia services.

References

  1. L. Amsaleg. A database perspective on large scale high-dimensional indexing. Habilitation à diriger des recherches, Université de Rennes 1, 2014.Google ScholarGoogle Scholar
  2. R. Arandjelovic and A. Zisserman. All about VLAD. In Proc. CVPR, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Babenko and V. S. Lempitsky. The inverted multi-index. TPAMI, 37(6), 2015.Google ScholarGoogle Scholar
  4. E. Y. Chang. Foundations of Large-Scale Multimedia Information Management and Retrieval: Mathematics of Perception. Springer, 2011.Google ScholarGoogle Scholar
  5. J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. CACM, 51(1), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. K. Grace, R. Manimegalai, and S. S. Kumar. Medical image retrieval system in grid using Hadoop framework. In Proc. ICCSCI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Gu and Y. Gao. A content-based image retrieval system based on Hadoop and Lucene. In Proc. ICCGC, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. S. Hare, S. Samangooei, D. P. Dupplaw, and P. H. Lewis. ImageTerrier: An extensible platform for scalable high-performance image retrieval. In Proc. ICMR, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Jai-Andaloussi, A. Elabdouli, A. Chaffai, N. Madrane, and A. Sekkaki. Medical content based image retrieval by using the hadoop framework. In Proc. ICT, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  10. H. Jégou, M. Douze, and C. Schmid. Hamming embedding and weak geometric consistency for large scale image search. In Proc. ECCV, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. Jégou, M. Douze, and C. Schmid. The Copydays image dataset. http://lear.inrialpes.fr/people/jegou/data.php#copydays, 2008.Google ScholarGoogle Scholar
  12. H. Jégou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. TPAMI, 33(1), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. H. Jégou, F. Perronnin, M. Douze, J. Sánchez, P. Pérez, and C. Schmid. Aggregating local image descriptors into compact codes. TPAMI, 34(9), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H. Lejsek, B. Þ. Jónsson, and L. Amsaleg. NV-Tree: Nearest neighbours at the billion scale. In Proc. ICMR, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Liu, C. Rosenberg, and H. Rowley. Clustering billions of images with large scale nearest neighbor search. In Proc. WACV, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Moise, D. Shestakov, G. Þ. Guðmundsson, and L. Amsaleg. Indexing and searching 100M images with Map-Reduce. In Proc. ICMR, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. Moise, D. Shestakov, G. Þ. Guðmundsson, and L. Amsaleg. Terabyte-scale image similarity search: experience and best practice. In Proc. Big Data, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  19. P. Moritz, R. Nishihara, I. Stoica, and M. I. Jordan. SparkNet: Training deep networks in Spark. CoRR, abs/1511.06051, 2015.Google ScholarGoogle Scholar
  20. D. Nistér and H. Stewénius. Scalable recognition with a vocabulary tree. In Proc. CVPR, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. B. C. Ooi, K.-L. Tan, S. Wang, W. Wang, Q. Cai, G. Chen, J. Gao, Z. Luo, A. K. Tung, Y. Wang, Z. Xie, M. Zhang, and K. Zheng. Singa: A distributed deep learning platform. In Proc. ACM MM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Owen, R. Anil, T. Dunning, and E. Friedman. Mahout in Action. Manning Publications Co., 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In Proc. CVPR, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  24. J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving particular object retrieval in large scale image databases. In Proc. CVPR, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  25. W. Premchaiswadi, A. Tungkatsathan, S. Intarasema, and N. Premchaiswadi. Improving performance of content-based image retrieval schemes using Hadoop MapReduce. In Proc. HPCS, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  26. D. Shestakov, D. Moise, G. Þ. Guðmundsson, and L. Amsaleg. Scalable high-dimensional indexing with Hadoop. In Proc. CBMI, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  27. K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop distributed file system. In Proc. SMSST, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. In Proc. ECCV, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. X. Sun, C. Wang, C. Xu, and L. Zhang. Indexing billions of images for sketch-based retrieval. In Proc. ACM MM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817, 2015.Google ScholarGoogle Scholar
  31. A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms. In Proc. ACM MM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. H. Wang, B. Xiao, L. Wang, and J. Wu. Accelerating large-scale image retrieval on heterogeneous architectures with Spark. In Proc. ACM MM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. B. White, T. Yeh, J. Lin, and L. S. Davis. Web-scale computer vision using MapReduce for multimedia data mining. In Proc. MDM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Q.-A. Yao, H. Zheng, Z.-Y. Xu, Q. Wu, Z.-W. Li, and L. Yun. Massive medical images retrieval system based on Hadoop. JMM, 9(2), 2014.Google ScholarGoogle Scholar
  35. D. Yin and D. Liu. Content-based image retrieval based on Hadoop. Mathematical Problems in Engineering, 2013.Google ScholarGoogle Scholar
  36. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. NSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In Proc. USENIX CHTCC, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. J. Zhang, X. Liu, J. Luo, and B. Lang. DISR: Distributed image retrieval system based on MapReduce. In Proc. PCA, 2010.Google ScholarGoogle Scholar

Index Terms

  1. Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MMSys'17: Proceedings of the 8th ACM on Multimedia Systems Conference
        June 2017
        407 pages
        ISBN:9781450350020
        DOI:10.1145/3083187

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 20 June 2017

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        MMSys'17 Paper Acceptance Rate13of47submissions,28%Overall Acceptance Rate176of530submissions,33%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader