ABSTRACT
In this paper we investigate the scalable processing of complex SPARQL queries on very large RDF datasets. As underlying platform we use Apache Hadoop, an open source implementation of Google's MapReduce for massively parallelized computations on a computer cluster. We introduce PigSPARQL, a system which gives us the opportunity to process complex SPARQL queries on a MapReduce cluster. To this end, SPARQL queries are translated into Pig Latin, a data analysis language developed by Yahoo! Research. Pig Latin programs are executed by a series of MapReduce jobs on a Hadoop cluster. We evaluate the processing of SPARQL queries by means of PigSPARQL using the SP2Bench, a SPARQL specific performance benchmark and demonstrate that PigSPARQL enables a scalable execution of SPARQL queries based on Hadoop without any additional programming efforts.
- D. J. Abadi, A. Marcus, S. Madden, and K. J. Hollenbach. Scalable Semantic Web Data Management Using Vertical Partitioning. In Proc. VLDB, pages 411--422, 2007. Google ScholarDigital Library
- Apache. Pig Latin Reference Manual 1 & 2. http://pig.apache.org/docs/, 2010.Google Scholar
- J. Broekstra, A. Kampman, and F. van Harmelen. Sesame: A generic architecture for storing and querying rdf and rdf schema. In Proc. ISWC, pages 54--68. Springer, 2002. Google ScholarDigital Library
- H. Choi, J. Son, Y. Cho, M. K. Sung, and Y. D. Chung. SPIDER: A System for Scalable, Parallel/Distributed Evaluation of Large-Scale RDF Data. In CIKM, pages 2087--2088, 2009. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarDigital Library
- O. Erling and I. Mikhailov. Towards web scale RDF. In Proc. SSWS, 2008.Google Scholar
- A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a high-level dataflow system on top of map-reduce: the pig experience. Proc. VLDB Endow., 2:1414--1425, 2009. Google ScholarDigital Library
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In Proc. SOSP, pages 29--43, 2003. Google ScholarDigital Library
- Y. Guo, Z. Pan, and J. Heflin. LUBM: A benchmark for OWL knowledge base systems. Web Semantics: Science, Services and Agents on the World Wide Web, 3(2--3):158--182, 2005. Google ScholarDigital Library
- S. Harris, N. Lamb, and N. Shadbolt. 4store: The design and implementation of a clustered rdf store. In Proc. SSWS, page 81, 2009.Google Scholar
- O. Hartig and R. Heese. The SPARQL query graph model for query optimization. The Semantic Web: Research and Applications, pages 564--578, 2007. Google ScholarDigital Library
- M. Husain, L. Khan, M. Kantarcioglu, and B. Thuraisingham. Data intensive query processing for large RDF graphs using cloud computing tools. In Proc. CLOUD, pages 1--10. IEEE, 2010. Google ScholarDigital Library
- M. Ley. DBLP Bibliography. http://www.informatik.uni-trier.de/ley/db/, 2010.Google Scholar
- J. Lin and C. Dyer. Data-intensive text processing with MapReduce. Synthesis Lectures on Human Language Technologies, 3(1):1--177, 2010. Google ScholarCross Ref
- F. Manola, E. Miller, and B. McBride. RDF Primer. http://www.w3.org/TR/rdf-primer/, 2004.Google Scholar
- B. McBride. Jena: Implementing the RDF Model and Syntax Specification. In SemWeb, 2001.Google ScholarDigital Library
- P. Mika and G. Tummarello. Web Semantics in the Clouds. IEEE Intelligent Systems, 23(5):82--87, 2008. Google ScholarDigital Library
- J. Myung, J. Yeon, and S. Lee. SPARQL basic graph pattern processing with iterative MapReduce. In Proc. MDAC, pages 1--6. ACM, 2010. Google ScholarDigital Library
- T. Neumann and G. Weikum. RDF-3X: a RISC-style engine for RDF. Proc. of the VLDB Endowment, 1(1):647--659, 2008. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In Proc. SIGMOD, pages 1099--1110. ACM, 2008. Google ScholarDigital Library
- A. Owens, A. Seaborne, and N. Gibbins. Clustered TDB: A Clustered Triple Store for Jena. 2008.Google Scholar
- A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In Proc. SIGMOD, pages 165--178. ACM, 2009. Google ScholarDigital Library
- J. Pérez, M. Arenas, and C. Gutierrez. Semantics and complexity of SPARQL. ACM Transactions on Database Systems (TODS), 34(3):1--45, 2009. Google ScholarDigital Library
- E. Prud'hommeaux and A. Seaborne. SPARQL Query Language for RDF. http://www.w3.org/TR/rdf-sparql-query/, 2006.Google Scholar
- P. Ravindra, V. Deshpande, and K. Anyanwu. Towards scalable RDF graph analytics on MapReduce. In Proc. MDAC, pages 1--6. ACM, 2010. Google ScholarDigital Library
- A. Schätzle, M. Przyjaciel-Zablocki, T. Hornung, and G. Lausen. PigSPARQL: Übersetzung von SPARQL nach PigLatin. In Proc. BTW, pages 65--84, 2011.Google Scholar
- M. Schmidt, T. Hornung, G. Lausen, and C. Pinkel. SP2Bench: A SPARQL Performance Benchmark. In Proc. ICDE, pages 222--233, 2009. Google ScholarDigital Library
- M. Schmidt, M. Meier, and G. Lausen. Foundations of SPARQL query optimization. In Proc. ICDT, pages 4--33, 2010. Google ScholarDigital Library
- M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and D. Reynolds. SPARQL basic graph pattern optimization using selectivity estimation. In Proc. WWW, pages 595--604. ACM, 2008. Google ScholarDigital Library
Index Terms
- PigSPARQL: mapping SPARQL to Pig Latin
Recommendations
PigSPARQL: a SPARQL query processing baseline for big data
ISWC-PD '13: Proceedings of the 12th International Semantic Web Conference (Posters & Demonstrations Track) - Volume 1035In this paper we discuss PigSPARQL, a competitive yet easy to use SPARQL query processing system on MapReduce that allows adhoc SPARQL query processing on large RDF graphs out of the box. Instead of a direct mapping, PigSPARQL uses the query language of ...
Comments