TopDown query of Linked Data
Recently, I was reading an article written for this year’s ISWC that takes place in Shanghai. It looks really cool by the way - I’m trying to follow it on Twitter, but it seems like they have some problems to access it with the “big firewall”. The article “Linked data Query Processing Strategies” present three query strategies for retrieving data from Linked data, as opposed to a single source.
Querying Linked Data is much of a challenge, in the sens that it’s one of the biggest (in term of volume) distributed database available. Its content is highly dynamic and new triples are added to that massive graphe every second.
In the end, querying Linked Data isn’t much different from querying the Internet. They are of the same nature. They’re both inconstant, changing, versatile.
One of the difference is that when we Google information on Internet, we don’t expect to see all the results, but only those who are relevant to our search. On Linked Data, the story changes a bit.
If we take SPARQL, which is the official standard for querying RDF graphes and one of the pillar of the semantic web movement, then the very nature of the queries between those we perform on Google and Linked Data are different. On Google, we use natural language containing keywords. A keyword, depending on the context of the whole sentence can mean different thing. i.e. “glass” can refer to a container, optical glass or the material.
On the other hand, SPARQL uses a formal language without ambiguity. To perform a query, we define a Partial Graphe consisting of constant and variables which is going to try to match the target source’s triple. If a match occures, then we have a result. All the results are matches so there is no easy way to say that a result is more relevant than another.
Another difference is the way Google access its results versus the way we do it on semantic web. This difference is a good illustration of the shifting of paradigm between relational databases and graphe distributed databases, although Google’s database is most likely distributed.
In a relational database, you know the boundaries of your database. You control what’s inside, even if it changes a lot. The information is indexed thanks to the criterias you have defined.
On a distributed graphe databases where anybody can add it’s dataset, you don’t own nor control anything. You don’t know the limit of the database and you can only estimate its size… Which is why it’s hard to perform queries on Linked Data.
A way to do so is to consider that the query engine contains a list of all the sources available on linked data. Once a query is made, the query engine queries all the sources, one by one, in different threads, and aggregate the results of each queries. This approach is naive because having a complete list of the sources available on Linked Data is like knowing the links of all the websites. But beside this, this strategy yields good results in term of performance. This strategy is called the TopDown method since it goes from the “top” (list of sources) down to the sources.
I investigated it and worked on an implementation of that strategy. My experiment can be found in this demo website: http://pure-day-56.heroku.com/
For those interested, it’s done using a Ruby library i’ve been working on, SparqlTransmission and Sinatra as the web framework.
At the time of this writing, only DBpedia is listed on the list of available sources. I’ll add more sources as soon as I’ve got the time.
()
