Difference between revisions of "Small 4store How-To"

From PaskvilWiki
Jump to: navigation, search
(Turtle File Format)
 
(One intermediate revision by one user not shown)
Line 59: Line 59:
 
     if line != "":
 
     if line != "":
 
         print line.split("\t")</pre>
 
         print line.split("\t")</pre>
 +
 +
=== Turtle File Format ===
 +
 +
Here's a simple Turtle format printer. IMO, Python's generator functions are ideal solution for this kind of task, as you can return the generated triplet, and not worry about what it's used for (printed to file, directly POST'ed, dumped to DB, ...), making the obviously orthogonal tasks really separate also in code.
 +
 +
<pre># this is the statements generator; do whatever you need to
 +
# generate the data to store, and just 'yield' them
 +
def gen_statement():
 +
    # some loop over DB data, generator, ...
 +
    # the 'subject', 'predicate', and 'object' may be numbers, URI's or strings
 +
        yield subject, predicate, object
 +
 +
def print_triplet_item(i):
 +
    tname = type(i).__name__
 +
    if tname == 'int' or tname == "float":
 +
        return str(i)
 +
    if tname == 'str':
 +
        if i[:7] == "http://":
 +
            return '<' + i + '>'
 +
        return '"' + i + '"'
 +
    raise "invalid type provided to print_triplet_item()"
 +
 +
# example - printer of Turtle files, using the condensed markup where possible
 +
def prep_for_load():
 +
    with open("example.ttl", "wt") as f:
 +
        gen = gen_statement()
 +
        ps, pp, po = s, p, o = next(gen)
 +
        f.write("%s %s %s "  % (print_triplet_item(s), print_triplet_item(p), print_triplet_item(o)))
 +
        for s, p, o in gen:
 +
            if ps == s and pp == p:
 +
                f.write(", %s " % print_triplet_item(o))
 +
            elif ps == s:
 +
                f.write(";\n    %s %s "  % (print_triplet_item(p), print_triplet_item(o)))
 +
            else:
 +
                f.write(".\n%s %s %s "  % (print_triplet_item(s), print_triplet_item(p), print_triplet_item(o)))
 +
            ps, pp, po = s, p, o
 +
        f.write(".")</pre>

Latest revision as of 21:53, 7 September 2012

The 4store RDF storage is "an efficient, scalable and stable RDF database".

Even though it's creators Garlik are currently using new 5store, the project is still developed, and honestly, became even more interesting with v1.1.5, esp. since it adds support for ORDER BY together with GROUP BY - e.g. ordering by average - which was not supported in previous versions.

It's recommended that you download tarball, not the Git snapshot. Then

./configure --prefix /some/folder/ --with-storage-path=/some/folder/4store --with-config-file=/some/folder/4store.conf CFLAGS=-O2 CPPFLAGS=-O3
make -j8
make install

Of course, you can leave the folders to their defaults if you so choose.

Now, lets start the 4store:

# create the KB called "kbname"
# do this only ONCE! this call destroys any previous data in KB
4s-backend-setup [kbname]

# start the backend to support the KB
4s-backend [kbname]

# start the HTTP SPARQL endpoint using one of these:
4s-httpd [kbname]                           # start endpoint listening on port 8080
4s-httpd -p [port] [kbname]                 # -"- on given port
4s-httpd -H [host] -p [port] [kbname]       # -"- on given port and host (e.g. 127.0.0.1 will limit access to localhost)

And import some data using:

curl -T data.ttl 'http://localhost:[port]/data/data.ttl'

Then go to http://localhost:[port]/test/ and test some SPARQL queries!

Python to 4store

It's recommendable (and not just for 4store communication, but in general) to use Requests library. It makes urllib2 look like it's written using runes.

Also, while PUT works fine from requests, you need to load the whole file first, and then PUT it on 4store, as 4store requires you to provide data length (when PUTting a file using requests directly from file, the length is not provided). For this reason I prefer to use curl via subprocess - saves process' memory, and is in fact faster. Or, you can use POST to add statements to KB bit-by-bit.

The following is a simple example of loading example.ttl file to 4store, and querying it to return first 10 "rows":

#! /usr/bin/python
import requests
import subprocess

# SPARQL endpoint
host = "http://localhost:8000/"

subprocess.call(["curl", "-T", "example.ttl", "-H", "Content-Type: text/turtle", host + "data/example.ttl"])

query = "SELECT ?s ?p ?o WHERE { ?s ?p ?o . } LIMIT 10"
data = { "query": query, "output": "text" }

r = requests.post(host + "sparql/", data=data)
if r.status_code != requests.codes.ok:   # something went wrong
    return

# print the results; for output=text we get TSV (tab separated values)
# the first line of the output are names of the variables in columns, thus the [:1]
for line in r.text.split("\n")[1:]:
    if line != "":
        print line.split("\t")

Turtle File Format

Here's a simple Turtle format printer. IMO, Python's generator functions are ideal solution for this kind of task, as you can return the generated triplet, and not worry about what it's used for (printed to file, directly POST'ed, dumped to DB, ...), making the obviously orthogonal tasks really separate also in code.

# this is the statements generator; do whatever you need to
# generate the data to store, and just 'yield' them
def gen_statement():
    # some loop over DB data, generator, ...
    # the 'subject', 'predicate', and 'object' may be numbers, URI's or strings
        yield subject, predicate, object

def print_triplet_item(i):
    tname = type(i).__name__
    if tname == 'int' or tname == "float":
        return str(i)
    if tname == 'str':
        if i[:7] == "http://":
            return '<' + i + '>'
        return '"' + i + '"'
    raise "invalid type provided to print_triplet_item()"

# example - printer of Turtle files, using the condensed markup where possible
def prep_for_load():
    with open("example.ttl", "wt") as f:
        gen = gen_statement()
        ps, pp, po = s, p, o = next(gen)
        f.write("%s %s %s "  % (print_triplet_item(s), print_triplet_item(p), print_triplet_item(o)))
        for s, p, o in gen:
            if ps == s and pp == p:
                f.write(", %s " % print_triplet_item(o))
            elif ps == s:
                f.write(";\n    %s %s "  % (print_triplet_item(p), print_triplet_item(o)))
            else:
                f.write(".\n%s %s %s "  % (print_triplet_item(s), print_triplet_item(p), print_triplet_item(o)))
            ps, pp, po = s, p, o
        f.write(".")