ABSTRACT
Computing power has now become abundant with multi-core machines, grids and clouds, but it remains a challenge to harness the available power and move towards gracefully handling web-scale datasets. Several researchers have used automatically distributed computing frameworks, notably Hadoop and Spark, for processing multimedia material, but mostly using small collections on small clusters. In this paper, we describe the engineering process for a prototype of a (near) web-scale multimedia service using the Spark framework running on the AWS cloud service. We present experimental results using up to 43 billion SIFT feature vectors from the public YFCC 100M collection, making this the largest high-dimensional feature vector collection reported in the literature. The design of the prototype and performance results demonstrate both the flexibility and scalability of the Spark framework for implementing multimedia services.
- L. Amsaleg. A database perspective on large scale high-dimensional indexing. Habilitation à diriger des recherches, Université de Rennes 1, 2014.Google Scholar
- R. Arandjelovic and A. Zisserman. All about VLAD. In Proc. CVPR, 2013. Google ScholarDigital Library
- A. Babenko and V. S. Lempitsky. The inverted multi-index. TPAMI, 37(6), 2015.Google Scholar
- E. Y. Chang. Foundations of Large-Scale Multimedia Information Management and Retrieval: Mathematics of Perception. Springer, 2011.Google Scholar
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. CACM, 51(1), 2008. Google ScholarDigital Library
- R. K. Grace, R. Manimegalai, and S. S. Kumar. Medical image retrieval system in grid using Hadoop framework. In Proc. ICCSCI, 2014. Google ScholarDigital Library
- C. Gu and Y. Gao. A content-based image retrieval system based on Hadoop and Lucene. In Proc. ICCGC, 2012. Google ScholarDigital Library
- J. S. Hare, S. Samangooei, D. P. Dupplaw, and P. H. Lewis. ImageTerrier: An extensible platform for scalable high-performance image retrieval. In Proc. ICMR, 2012. Google ScholarDigital Library
- S. Jai-Andaloussi, A. Elabdouli, A. Chaffai, N. Madrane, and A. Sekkaki. Medical content based image retrieval by using the hadoop framework. In Proc. ICT, 2013.Google ScholarCross Ref
- H. Jégou, M. Douze, and C. Schmid. Hamming embedding and weak geometric consistency for large scale image search. In Proc. ECCV, 2008. Google ScholarDigital Library
- H. Jégou, M. Douze, and C. Schmid. The Copydays image dataset. http://lear.inrialpes.fr/people/jegou/data.php#copydays, 2008.Google Scholar
- H. Jégou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. TPAMI, 33(1), 2011. Google ScholarDigital Library
- H. Jégou, F. Perronnin, M. Douze, J. Sánchez, P. Pérez, and C. Schmid. Aggregating local image descriptors into compact codes. TPAMI, 34(9), 2012. Google ScholarDigital Library
- H. Lejsek, B. Þ. Jónsson, and L. Amsaleg. NV-Tree: Nearest neighbours at the billion scale. In Proc. ICMR, 2011. Google ScholarDigital Library
- T. Liu, C. Rosenberg, and H. Rowley. Clustering billions of images with large scale nearest neighbor search. In Proc. WACV, 2007. Google ScholarDigital Library
- D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2), 2004. Google ScholarDigital Library
- D. Moise, D. Shestakov, G. Þ. Guðmundsson, and L. Amsaleg. Indexing and searching 100M images with Map-Reduce. In Proc. ICMR, 2013. Google ScholarDigital Library
- D. Moise, D. Shestakov, G. Þ. Guðmundsson, and L. Amsaleg. Terabyte-scale image similarity search: experience and best practice. In Proc. Big Data, 2013.Google ScholarCross Ref
- P. Moritz, R. Nishihara, I. Stoica, and M. I. Jordan. SparkNet: Training deep networks in Spark. CoRR, abs/1511.06051, 2015.Google Scholar
- D. Nistér and H. Stewénius. Scalable recognition with a vocabulary tree. In Proc. CVPR, 2006. Google ScholarDigital Library
- B. C. Ooi, K.-L. Tan, S. Wang, W. Wang, Q. Cai, G. Chen, J. Gao, Z. Luo, A. K. Tung, Y. Wang, Z. Xie, M. Zhang, and K. Zheng. Singa: A distributed deep learning platform. In Proc. ACM MM, 2015. Google ScholarDigital Library
- S. Owen, R. Anil, T. Dunning, and E. Friedman. Mahout in Action. Manning Publications Co., 2011. Google ScholarDigital Library
- J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In Proc. CVPR, 2007.Google ScholarCross Ref
- J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving particular object retrieval in large scale image databases. In Proc. CVPR, 2008.Google ScholarCross Ref
- W. Premchaiswadi, A. Tungkatsathan, S. Intarasema, and N. Premchaiswadi. Improving performance of content-based image retrieval schemes using Hadoop MapReduce. In Proc. HPCS, 2013.Google ScholarCross Ref
- D. Shestakov, D. Moise, G. Þ. Guðmundsson, and L. Amsaleg. Scalable high-dimensional indexing with Hadoop. In Proc. CBMI, 2013.Google ScholarCross Ref
- K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop distributed file system. In Proc. SMSST, 2010. Google ScholarDigital Library
- J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. In Proc. ECCV, 2003. Google ScholarDigital Library
- X. Sun, C. Wang, C. Xu, and L. Zhang. Indexing billions of images for sketch-based retrieval. In Proc. ACM MM, 2013. Google ScholarDigital Library
- B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817, 2015.Google Scholar
- A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms. In Proc. ACM MM, 2010. Google ScholarDigital Library
- H. Wang, B. Xiao, L. Wang, and J. Wu. Accelerating large-scale image retrieval on heterogeneous architectures with Spark. In Proc. ACM MM, 2015. Google ScholarDigital Library
- B. White, T. Yeh, J. Lin, and L. S. Davis. Web-scale computer vision using MapReduce for multimedia data mining. In Proc. MDM, 2010. Google ScholarDigital Library
- Q.-A. Yao, H. Zheng, Z.-Y. Xu, Q. Wu, Z.-W. Li, and L. Yun. Massive medical images retrieval system based on Hadoop. JMM, 9(2), 2014.Google Scholar
- D. Yin and D. Liu. Content-based image retrieval based on Hadoop. Mathematical Problems in Engineering, 2013.Google Scholar
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. NSDI, 2012. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In Proc. USENIX CHTCC, 2010. Google ScholarDigital Library
- J. Zhang, X. Liu, J. Luo, and B. Lang. DISR: Distributed image retrieval system based on MapReduce. In Proc. PCA, 2010.Google Scholar
Index Terms
- Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark
Recommendations
Prototyping a Web-Scale Multimedia Retrieval Service Using Spark
Special Section on Delay-Sensitive Video Computing in the Cloud and Special Section on Extended MMSys-NOSSDAV Best PapersThe world has experienced phenomenal growth in data production and storage in recent years, much of which has taken the form of media files. At the same time, computing power has become abundant with multi-core machines, grids, and clouds. Yet it ...
Towards building an analytics platform in the cloud
CF '15: Proceedings of the 12th ACM International Conference on Computing FrontiersRecently enterprises have been able to leverage two revolutionary new tools for gaining a competitive advantage for their business -- cloud computing and analytic applications. Cloud computing unburdens them from running and maintaining their compute ...
Towards a framework for large-scale multimedia data storage and processing on Hadoop platform
Cloud computing techniques take the form of distributed computing by utilizing multiple computers to execute computing simultaneously on the service side. To process the increasing quantity of multimedia data, numerous large-scale multimedia data ...
Comments