I noted in the first post in this mini series that the original “can you just add in links to the OS post code data” task threw up some surprises and extra issues that needed attention. In particular, the data used in the Research Funding Explorer used a very simple data model that assumed a single location for each participating business. If you ask the SPARQL service to DESCRIBE <http://business.data.gov.uk/id/company/1996687> you’ll see that <http://business.data.gov.uk/id/company/1996687> identifies BAE Systems which has, among others, 9 rdfs:label properties, 12 addressLine1 properties, and that BAE Systems is located in the counties of Bristol, Essex, Hants, Kent, Lancashire and Lancashires!
Oh boy, this is going to take some sorting out. And because the data for the multiple locations of BAE systems is all mashed in together, we’re going to need to sort it out by hand and with reference to BAE’s Web site to find out which address line 2 goes with which address line 1 and so on. This is where the organisation ontology really comes in to its own. It models the concept of a single legal entity with a formal name and an informal one (using the well-known skos:preLabel and skos:altLabel properties) and of things like a primary site and other sites (org:hasPrimarySite and org:hasSite respectively). Then you can say something about each of those sites including giving its address – for which the ontology requires the use of VCard.
I am not alone in disliking VCard. The RDF encoding created by Harry Halpin et al is a very good encoding of VCard itself but that’s not to say it’s a good encoding of addresses! I raised a question on the Linked Open data mailing list about this. Follow the thread if you like but what comes out, as ever, is that existing implementation and re-use of vocabularies beats fitness for purpose every time. In other words, many people agree that VCard is not a good model for addresses (at least, for non-US addresses), however, it’s what is widely used and therefore it’s what everyone uses. The org ontology is right for modelling organisations so I wanted to use it, but I used the VCard ontology with some reluctance. Be that as it may, I ended up with a much better model of, for example, BAE systems:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rs: <http://research.data.gov.uk/def/project/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix v: <http://www.w3.org/2006/vcard/ns#> .
@prefix org: <http://www.w3.org/ns/org#> .
@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .
<http://business.data.gov.uk/id/company/1996687> a org:FormalOrganization;
skos:altLabel "BAE Systems" ;
skos:prefLabel "BAE Systems (Operations) Ltd" ;
rdfs:seeAlso <http://opencorporates.com/companies/uk/1996687> ;
org:identifier "25.40"^^<http://research.data.gov.uk/def/project/SICCode>, "2713"^^<http://research.data.gov.uk/def/project/ICBCode>, "30.30"^^<http://research.data.gov.uk/def/project/SICCode> ;
org:hasSite <http://business.data.gov.uk/id/company/1996687/BB27LF> ;
org:hasSite <http://business.data.gov.uk/id/company/1996687/BS347QW> ;
org:hasSite <http://business.data.gov.uk/id/company/1996687/CM28HN> ;
org:hasSite <http://business.data.gov.uk/id/company/1996687/PR41AX> ;
org:hasSite <http://business.data.gov.uk/id/company/1996687/ME12XX> .
<http://business.data.gov.uk/id/company/1996687/BS347QW> a org:Site ;
org:siteAddress <http://business.data.gov.uk/id/company/1996687/BS347QW/vcard> ;
geo:lat "51.5265302" ;
geo:long "-2.5622897" .
<http://business.data.gov.uk/id/company/1996687/BS347QW/vcard> a v:VCard ;
v:fn "BAE Systems" ;
v:geo <http://business.data.gov.uk/id/company/1996687/BS347QW/vcard/geo> ;
v:adr <http://business.data.gov.uk/id/company/1996687/BS347QW/vcard/adr> .
<http://business.data.gov.uk/id/company/1996687/BS347QW/vcard/geo> a v:Location ;
v:latitude "51.5265302" ;
v:longitude "-2.5622897" .
<http://business.data.gov.uk/id/company/1996687/BS347QW/vcard/adr> a v:Address ;
v:post-office-box "PO Box 5" ;
v:extended-address "New Filton House" ;
v:street-address "Filton" ;
v:locality "Bristol" ;
v:postal-code "BS34 7QW" ;
v:country-name "United Kingdom" ;
<http://data.ordnancesurvey.co.uk/ontology/postcode/postcode> <http://data.ordnancesurvey.co.uk/id/postcodeunit/BS347QW> ;
v:label """BAE Systems
New Filton House
PO Box 5
Filton
Bristol
BS34 7QW
United Kingdom""" .
…
More triples to describe the other locations.
Here we see org:hasSite properties pointing to nodes for each of 5 different sites where we can use the VCard structures to give address information. It may appear to be a little long winded but it is very clear. Furthermore, it allows us to include links to the OS post code data in a meaningful way… which was the point of the exercise in the first place.
Also, notice that I’ve included a link to the Open Corporates URI for this company. Sensibly, Open Corporates uses the Companies House registration number in the identifier for each company, as the original data did here so that for the identifier <http://business.data.gov.uk/id/company/1996687> we have an rdfs:seeAlso link to <http://opencorporates.com/companies/uk/1996687>. As an aside, notice that the value for the v:label is triple quoted. This is a long string, i.e. one that can include line breaks, and therefore (according to the Turtle spec) needs the triple quotes. For VCard, the label is an address label suitable for printing and sticking on a letter so it allows you to include a real address formatted the way you want cf. the way VCard demands.
OK, this is starting to look easy and ‘right’ – I’m feeling comfortable. So I set up a little tool to let me see all the companies with multiple rs:location properties. I did this using the following SPARQL query which includes some features in the SPARQL 1.1 spec that the Talis Platform already implements:
SELECT ?firm (count(DISTINCT ?location) as ?count)
WHERE {
?firm <http://research.data.gov.uk/def/project/location> ?location .
}
GROUP BY ?firm
ORDER BY DESC(?count)
This helped a lot. It gave me a list of companies I could look at, remodel the data and write out new descriptions for – all the while thinking in terms of adding new triples, not deleting any of them (remember the demo works – don’t fix it!). Although it was possible, this process was time consuming, especially since the original data didn’t include company registration numbers for all the businesses. In such cases, the data set used the company name in the identifier: <http://business.data.gov.uk/id/company/CambridgeDisplayTechnologyLimited> for example. But that threw up yet another problem since if a company was involved in more than one research project, and many of them were, any variance or error in the way the name was typed created a new record. So for Cambridge Display Technology we had 2 quite separate entries. Smith & Nephew (Smith and Nephew Ltd, Smith and Nephew Limited, Smith & Nephew UK Ltd…) had 5 separate identifiers! I handled this by including owl:sameAs statements to link the various identifiers for the same businesses but, nonetheless, included the org:hasSite property for each alias for ease of querying.
Where a company had just one location, the process of generating the new triples to use the org ontology, add in the links to the OS post code and Open Corporates data sets was much more amenable to automation. Even so, an element of manual editing was necessary, especially where it was necessary to match up different descriptions of the same business or handle businesses about which the data was incomplete. For some businesses there was no address data at all and it wasn’t always possible to find an unambiguous entry in the Open Corporates data set. For example, search for Cambridge University and you’ll see 9 entries in the Open Corporates data set – and none of them seem to be relevant to our needs here. Conversely, Companies House (on whose UK data Open Corporates is based) only covers registered companies, not sole traders, so the data is not a complete list of every trading entity.
I ended up generating and then hand-editing turtle files for each business (validated in turn using Joshua Tauberer’s validator). I was then able to use a combination of Redland’s rapper utility and a simple cat command to generate a file with all the new triples that I wanted.
Only now did I go through and correct the erroneous post codes (there are only a few of these remember so a simple search and replace was all that was needed). The original text-only post code data, like all the original data, remains unaffected – and therefore the demo (and any other applications) won’t break.
As a final action, I looked at the patents data. This is a much simpler data set that just gives minimal information about patents holders. The address data is given as a single string so here I just ran a script that (using a regular expression) pulled out the post codes and checked that they exist in the Ordnance Survey data set. There were rather more 404s than we got for the main businesses data but only a handful of typos (the letter O where it should be a zero, a letter I where there should be a number 1), so this part of the process was achieved in a very short time.
In summary, the data available from business.data.gov.uk and patents.data.gov.uk, concerns companies involved with research projects and the outcome of those projects. As a result of the work done in recent weeks it now has:
- better modelling of each company, using Dave Reynolds’ org ontology;
- links, where possible, to the Ordnance Survey post code data;
- links, where possible, to the Open Corporates data;
- fewer errors.
In the final post in this mini series, I’ll look at some queries that can be run against the new data.