ABSTRACT
With the proliferation of the RDF data format, engines for RDF query processing are faced with very large graphs that contain hundreds of millions of RDF triples. This paper addresses the resulting scalability problems. Recent prior work along these lines has focused on indexing and other physical-design issues. The current paper focuses on join processing, as the fine-grained and schema-relaxed use of RDF often entails star- and chain-shaped join queries with many input streams from index scans.
We present two contributions for scalable join processing. First, we develop very light-weight methods for sideways information passing between separate joins at query run-time, to provide highly effective filters on the input streams of joins. Second, we improve previously proposed algorithms for join-order optimization by more accurate selectivity estimations for very large RDF graphs. Experimental studies with several RDF datasets, including the UniProt collection, demonstrate the performance gains of our approach, outperforming the previously fastest systems by more than an order of magnitude.
- D. J. Abadi et al. Scalable semantic web data management using vertical partitioning. In VLDB, 2007. Google ScholarDigital Library
- S. Auer et al. Dbpedia: A nucleus for a web of open data. In ISWC/ASWC, 2007. Google ScholarDigital Library
- R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In SIGMOD, 2000. Google ScholarDigital Library
- P. A. Bernstein and D.-M. W. Chiu. Using semi-joins to solve relational queries. J. ACM, 28(1):25--40, 1981. Google ScholarDigital Library
- Bio2RDF, semantic web atlas of postgenomic knowledge about human and mouse. http://www.bio2rdf.org/.Google Scholar
- U. Bojars et al. Interlinking the social web with semantics. IEEE Intelligent Systems, 23(3), 2008. Google ScholarDigital Library
- J. Broekstra et al. Sesame: An architecture for storing and querying rdf data and schema information. In Spinning the Semantic Web, 2003. Google ScholarDigital Library
- E. I. Chong et al. An efficient sql-based rdf querying scheme. In VLDB, 2005. Google ScholarDigital Library
- Dbpedia 3.2 downloads. http://wiki.dbpedia.org/Downloads32.Google Scholar
- D. DeHaan and F. W. Tompa. Optimal top-down join enumeration. In SIGMOD, 2007. Google ScholarDigital Library
- A. Deshpande et al. Adaptive query processing. Foundations and Trends in Databases, 1(1), 2007. Google ScholarDigital Library
- Open directory RDF dump. http://rdf.dmoz.org/.Google Scholar
- Metaweb technologies: Freebase data dumps. http://download.freebase.com/datadumps/.Google Scholar
- C. A. Galindo-Legaria, A. Pellenkoft, and M. L. Kersten. Fast, randomized join-order selection - why use transformations? In VLDB, 1994. Google ScholarDigital Library
- G. Graefe. Query evaluation techniques for large databases. ACM Comput. Surv., 25(2):73--170, 1993. Google ScholarDigital Library
- A. Y. Halevy, M. J. Franklin, and D. Maier. Principles of dataspace systems. In PODS, pages 1--9, 2006. Google ScholarDigital Library
- A. Harth et al. Yars2: A federated repository for querying graph structured data from the web. In ISWC/ASWC, 2007. Google ScholarDigital Library
- Z. G. Ives and N. E. Taylor. Sideways information passing for push-style query processing. In ICDE, 2008. Google ScholarDigital Library
- Jena: a Semantic Web Framework for Java. http://jena.sourceforge.net/.Google Scholar
- D. Kossmann. The state of the art in distributed query processing. ACM Comput. Surv., 32(4):422--469, 2000. Google ScholarDigital Library
- A. Maduko et al. Estimating the cardinality of rdf graph patterns. In WWW, 2007. Google ScholarDigital Library
- G. Moerkotte and T. Neumann. Analysis of two existing and one new dynamic programming algorithm for the generation of optimal bushy join trees without cross products. In VLDB, 2006. Google ScholarDigital Library
- MonetDB. http://monetdb.cwi.nl/.Google Scholar
- I. S. Mumick and H. Pirahesh. Implementation of magic-sets in a relational database system. In SIGMOD, 1994. Google ScholarDigital Library
- T. Neumann and G. Weikum. Rdf-3x: a risc-style engine for rdf. PVLDB, 1(1):647--659, 2008. Google ScholarDigital Library
- OpenRDF. http://www.openrdf.org/index.jsp.Google Scholar
- PostgreSQL. http://www.postgresql.org/.Google Scholar
- RDF-3X. http://www.mpi-inf.mpg.de/~neumann/rdf3x.Google Scholar
- Semantic web challenge 2008. billion triples track. http://challenge.semanticweb.org/.Google Scholar
- P. Seshadri et al. Cost-based optimization for magic: Algebra and implementation. In SIGMOD, 1996. Google ScholarDigital Library
- L. Sidirourgos et al. Column-store support for rdf data management: not all swans are white. PVLDB, 1(2):1553--1563, 2008. Google ScholarDigital Library
- M. Steinbrunn et al. Bypassing joins in disjunctive queries. In VLDB, 1995. Google ScholarDigital Library
- K. Stocker et al. Integrating semi-join-reducers into state of the art query processors. In ICDE, 2001. Google ScholarDigital Library
- M. Stocker et al. Sparql basic graph pattern optimization using selectivity estimation. In WWW, 2008. Google ScholarDigital Library
- F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In WWW, 2007. Google ScholarDigital Library
- O. Udrea, A. Pugliese, and V. S. Subrahmanian. Grin: A graph based rdf index. In AAAI, 2007. Google ScholarDigital Library
- Uniprot RDF. http://dev.isb-sib.ch/projects/uniprot-rdf/.Google Scholar
- C. Weiss, P. Karras, and A. Bernstein. Hexastore: sextuple indexing for semantic web data management. PVLDB, 1(1):1008--1019, 2008. Google ScholarDigital Library
- K. Wilkinson et al. Efficient rdf storage and retrieval in jena2. In SWDB, 2003.Google Scholar
- F. Wu and D. S. Weld. Automatically refining the wikipedia infobox ontology. In WWW, 2008. Google ScholarDigital Library
Index Terms
- Scalable join processing on very large RDF graphs
Recommendations
The RDF-3X engine for scalable management of RDF data
RDF is a data model for schema-free structured information that is gaining momentum in the context of Semantic-Web data, life sciences, and also Web 2.0 platforms. The "pay-as-you-go" nature of RDF and the flexible pattern-matching capabilities of its ...
Combining Joint and Semi-Join Operations for Distributed Query Processing
The application of a combination of join and semi-join operations to minimize the amount of data transmission required for distributed query processing is discussed. Specifically, two important concepts that occur with the use of join operations as ...
Comments