Talis Consultancy
World leading expertise in Linked Data and the Semantic Web

Category: How To

Barriers to ontology reuse

confusion © 2009 Tim Hodson

Recent work on the British Library’s bibliographic data model has given me some examples of identifying appropriate usage of properties in the Dublin Core and ISBD vocabularies. These were the vocabularies I was using, but you can generalise these examples too.

I should point out that this post is written from the point of view of a developer wanting to work out whether the use of a property or class found in a vocabulary is pertinent to a particular use case. I am assuming no prior knowledge of the debate surrounding the creation of these vocabularies, and realise that there will be many of you who can cite good reasons for decisions taken. I have also tried to keep my general dislike of record centric data models out of this post (I am not even going to mention that there may be some issues with the existence of ISBD properties in the first place).

As already eloquently demonstrated on the foaf wiki, the dcterms:creator definition is somewhat inconsistant leaving room for misinterpretation. The documentation appears to contradict itself, and suggests that both literals and resources are suitable as values for the property.

This leads to a mixture of implementations. In the British Library’s model, a decision was taken to treat the value as a resource. This makes the most sense for lots of reasons, as it allows our data to continue growing, rather than be stuck at a ‘literal’ dead end. We cannot say more things about a literal.

From the Dublin Core Terms example, we learn that when defining a property (or class), we should use clear definitions that use unambiguous wording.

The ISBD element sets have committed these ‘sins’ which make it very hard to choose to work with such a vocabulary.

  • Fundamental problem: the HTTP uris for the properties and classes do not resolve to anything. This means that I as a developer cannot look up any useful definition of the property. This may change in the future, but doesn’t help me now…
  • Fundamental problem: The names of the properties are alphanumeric codes. This is compounded by issue 1, which means I cannot lookup any definition of this property.
  • What information I can find via a google search leads me to a confusing metadata registry.

I am told that the ISBD property names are named with codes so as to render them language neutral. In the context of the semantic web, where language plays an important part in defining what a term means, it seems rather obtuse to hide that meaning behind a code.

Even if the property name itself is named in English, multiple labels can be given to that term in as many languages as necessary. Then, by making sure that the URI for the property dereferences to some useful data, anyone can easily find out what the term means by looking at whichever language they are comfortable with. There is also a possibility that a term in another language may have a subtly different meaning which should be expressed in that language. Again, use of multilingual labels can be used to make it clear what the differences are and why.

My basic message is: don’t make it so hard for people to find out how to use your vocabulary or ontology. And if it is hard, you might ask yourself whether it makes sense to model the vocabulary or ontology in the way that you have.

Specific lessons to learn:

  • Make sure that your properties and classes are clearly defined and use unambiguous wording.
  • Use sensible and descriptive names for your properties and classes.
  • Use additional labels with appropriate language types to make your ontology multilingual.
  • Make sure that the URIs for your properties and classes resolve to the definitions of those properties and classes (use RDF).
  • Don’t hide your vocabularies in complex registries, just publish them as a document, or series of documents.

I am aware that I haven’t mentioned domains and ranges. These are used to add extra descriptive information about the types of thing to be found on the left and right of properties. Sometimes this is a good thing, and sometimes this can add too many ‘restrictions’, although they are not truly restrictions, as you would be asserting a fact that doesn’t make sense, but in RDF it is still a valid fact. I might deal with this issue in another post.

If you want to know more, then why not try one of Talis Consulting’s training courses. We have an open course coming up in November, or we can run bespoke training tailored to your organisation’s needs.

Why resources in Linked Data are good.

One of the characteristics of creating a set of Linked Data is the ability, inherent in RDF, for merging of things which have the same URI. This allows us to say different things about something in different contexts, and have the statements attributed to the same thing.

A thing sometimes overlooked when modelling a dataset is just how useful the merging characteristics of URIs can be.

Take the scenario “finding all the things that share a characteristic”. In a Relational Database model we would have a table containing a row about each thing, and a column describing an attribute of that thing. Each cell of the table would then hold a value. To find all rows which share a common characteristic, we would have to look through each value in the column and return the row if it matched what we were looking for. Indexes on tables in Relational Database Management Systems allow this sort of thing to happen quickly, but even so, doing some sort of value comparison lookup in a table with hundreds of millions of rows will still be slow.

How can we simplify the model to achieve the same thing in a Linked Data dataset?

Rather than modelling the data to contain literal values for nodes that hold data we would like to compare, we can use URIs to represent the thing that data describes. The use of URIs means that simple patterns in SPARQL queries can be used to find things that share characteristics.

An example:

In our first example data, things have literal values representing dates.
We want to find all things that happened on a certain day.

In this image we see how we might have modelled that in a RDBMS.

Table model

We would have to compare each value to the others in the column, and so in our SQL query we would do this:

SELECT date FROM table WHERE date = "2010-12-05"

In the next image, the same data is modelled (badly) in RDF as a table like structure.

Modelled as literals

Which leads us to build a SPARQL query that mimics the SQL query, and uses a FILTER to do a comparison on the date literals.

PREFIX ex: <http://example.com/schema#>
SELECT ?thing WHERE {
     ?thing a <ex:Thing> ;
            ex:date ?date .
    FILTER (?date == "2010-12-05")
}

FILTERs basically mean that we have to test each value returned, so in most triple stores FILTER queries are slower than simply matching against a portion of the graph.

This image shows the same data modelled using URIs to represent the days. Days are after-all things in their own right, about which we can say more stuff.

Modelled as resources

In this data we can query for everything on one day really easily:

PREFIX ex: <http://example.com/schema#>
SELECT ?thing WHERE {
     ?thing ex:date <http://reference.gov.uk/id/day/2010-12-05> ;
}

Because the dates described by reference.data.gov.uk has used a very detailed model to describe how one date is related to another we can do more complex queries like finding all things in a particular month because we have a URI for a month which is related to a date:

Which allows us to do queries like:

PREFIX ex: <http://example.com/schema#>
PREFIX interval: <http://reference.data.gov.uk/def/intervals/>
SELECT ?thing WHERE {
     <http://reference.data.gov.uk/id/month/2010-12> interval:intervalContainsDay ?date .
     <thing> ex:date ?date .
}

These last two SPARQL queries are simply matching a portion of the graph and not having to use the more computationally expensive FILTER.

Resources are your friends.