Difference between revisions of "Small SPARQL, RDQL, etc. Cheat Sheet"

From PaskvilWiki
Jump to: navigation, search
(Basic Observations)
 
(3 intermediate revisions by one user not shown)
Line 4: Line 4:
  
 
I'm using Redland 1.0.13, Raptor 2.0.4, and Rasqal 0.9.26 as reference implementation of SPARQL 1.0, SPARQL 1.1, and RDQL.
 
I'm using Redland 1.0.13, Raptor 2.0.4, and Rasqal 0.9.26 as reference implementation of SPARQL 1.0, SPARQL 1.1, and RDQL.
 +
 +
Most of the timing and optimization hints presented here are derived from experiments with Redland (and to lesser extent, with 4store).
  
 
== Basic Observations ==
 
== Basic Observations ==
Line 13: Line 15:
 
For example, lets find all items that "user X" bought, that are blue. Lets presume that there are many more blue items in the DB than items that "user X" bought.
 
For example, lets find all items that "user X" bought, that are blue. Lets presume that there are many more blue items in the DB than items that "user X" bought.
  
Then the query (get all things that are blue, that "user X" bought):
+
Then the query (get all things that "user X" bought, that are blue):
  SELECT ?thing WHERE { ?thing _:color "blue" . "user X" _:bought ?thing }
+
  SELECT ?thing WHERE { "user X" _:bought ?thing . ?thing _:color "blue" . }
will typically run (much) slower, than  (get all things that "user X" bought, that are blue):
+
will typically run (much) ''faster'', than  (get all things that are blue, that "user X" bought):
  SELECT ?thing WHERE { "user X" _:bought ?thing . ?thing _:color "blue" }
+
  SELECT ?thing WHERE { ?thing _:color "blue" . "user X" _:bought ?thing . }
  
Note that the result set is identical, but the former query first takes all the blue things and picks those bought by "user X", while the latter takes the small set of bought items and picks just the blue ones.
+
Note that the result set is identical, but the latter query first takes all the blue things and picks those bought by "user X", while the former takes the small set of bought items and picks just the blue ones.
  
 
In general - '''graph databases are incredibly powerful tools, but it's up to you to make them smart!'''
 
In general - '''graph databases are incredibly powerful tools, but it's up to you to make them smart!'''
Line 27: Line 29:
  
 
The gain is storage and application specific - e.g. Redland library loads ~100K model 5 times faster using "hashes" storage then when adding statements one by one, but the difference becomes negligible using MySQL storage.
 
The gain is storage and application specific - e.g. Redland library loads ~100K model 5 times faster using "hashes" storage then when adding statements one by one, but the difference becomes negligible using MySQL storage.
 +
 +
=== Query Complexity Factors ===
 +
 +
Please note that while storage system inherently makes retrieval of data slower when using persistent storage, the system might take advantage of storage's preexisting capabilities.
 +
 +
In my tests, "hashes" storage was 10-100 times faster for "trivial" queries, requiring single step or single comparison, than "mysql" storage (but both in <10 ms times).
 +
On the other hand, the "hashes" storage explodes on queries requiring more steps through the graph, or using joins/intersections, or including boolean operations - e.g. selecting number of items that have 2+ attributes in common is 100+ times faster using "mysql" storage (where "hashes" queries go to seconds already on small graphs).
 +
 +
=== Avoid FILTER BY and Other Comparisons where Speed Matters ===
 +
 +
I know this sounds trivial, but often you'll use filtering of input data just "to make it more precise" or "to be more accurate".
 +
 +
But if you can avoid the filtering, or neglect the influence of the mistake, do it.
 +
 +
E.g. if you want average value of an attribute for all but one user, rather go for "for all", and either get rid of the one user in postprocess, or neglect the influence. The speed up may be quite significant (5-100 times).
  
 
== Turtle and SPARQL ==
 
== Turtle and SPARQL ==

Latest revision as of 02:26, 4 September 2012

For the lack of the same, I'll put here some of my notes on SPARQL, RDQL, graph databases, and semantic web related topics in general... Will probably branch out to several pages in future, but for now, it's just a small mess.

Introduction

I'm using Redland 1.0.13, Raptor 2.0.4, and Rasqal 0.9.26 as reference implementation of SPARQL 1.0, SPARQL 1.1, and RDQL.

Most of the timing and optimization hints presented here are derived from experiments with Redland (and to lesser extent, with 4store).

Basic Observations

Optimize WHERE's

Main rule of thumb I observed in many systems - try to guess what statement of the WHERE clause restricts the triplets set the most, and order the statements in increasing order of generality (most restrictive first).

For example, lets find all items that "user X" bought, that are blue. Lets presume that there are many more blue items in the DB than items that "user X" bought.

Then the query (get all things that "user X" bought, that are blue):

SELECT ?thing WHERE { "user X" _:bought ?thing . ?thing _:color "blue" . }

will typically run (much) faster, than (get all things that are blue, that "user X" bought):

SELECT ?thing WHERE { ?thing _:color "blue" . "user X" _:bought ?thing . }

Note that the result set is identical, but the latter query first takes all the blue things and picks those bought by "user X", while the former takes the small set of bought items and picks just the blue ones.

In general - graph databases are incredibly powerful tools, but it's up to you to make them smart!

Loading vs Insertion

In general, loading a model from a file is faster than inserting triplets one by one from code. Of course, esp. if model first loads the data and then indexes them, the gain might be significant for large(r) amounts of data.

The gain is storage and application specific - e.g. Redland library loads ~100K model 5 times faster using "hashes" storage then when adding statements one by one, but the difference becomes negligible using MySQL storage.

Query Complexity Factors

Please note that while storage system inherently makes retrieval of data slower when using persistent storage, the system might take advantage of storage's preexisting capabilities.

In my tests, "hashes" storage was 10-100 times faster for "trivial" queries, requiring single step or single comparison, than "mysql" storage (but both in <10 ms times). On the other hand, the "hashes" storage explodes on queries requiring more steps through the graph, or using joins/intersections, or including boolean operations - e.g. selecting number of items that have 2+ attributes in common is 100+ times faster using "mysql" storage (where "hashes" queries go to seconds already on small graphs).

Avoid FILTER BY and Other Comparisons where Speed Matters

I know this sounds trivial, but often you'll use filtering of input data just "to make it more precise" or "to be more accurate".

But if you can avoid the filtering, or neglect the influence of the mistake, do it.

E.g. if you want average value of an attribute for all but one user, rather go for "for all", and either get rid of the one user in postprocess, or neglect the influence. The speed up may be quite significant (5-100 times).

Turtle and SPARQL

SPARQL and Turtle share part of the syntax, and personally I prefer Turtle to other RDF syntax esp. due to this.

The following groupings of triplets make the data easier to read, and also might give good hints to query parsers, or ease up the work for RDF importers.

Grouping by same subject and predicate

:a :b :c , :d , :e .

is equivalent to

:a :b :c .
:a :b :d .
:a :b :e .

Grouping by same subject

:a :b :c ;
   :d :e .

is equivalent to

:a :b :c .
:a :d :e .