Talis Consultancy
World leading expertise in Linked Data and the Semantic Web

Category: Development

Open data cities demonstrator

Photo credit: Tim Hodson

At the Open Data Cities Conference in Brighton, Talis unveiled their demonstrator app (link below) which shows how a city might begin to engage with it’s citizens and promote digital economy innovation.

The demo is designed to highlight the ways in which a city and its citizens might be brought together in an information marketplace. The demo is designed to trigger questions around how cities might use an interactive information marketplace to measure social impact. The demo is the software equivalent of a thought piece, allowing us to talk about the things that might change the way people in your city think about engaging with each other in social enterprise.

Talis have been exploring ways in which a data marketplace might add value to individual datasets, and have built Kasabi which allows anyone to publish their data easily, and then harness the power of multiple data access channels.

Key demonstrator themes:

  • citizens can request data about their local area
  • citizens can use data, from the city and local businesses, to build apps
  • the city might fund the building of apps that are in demand
  • citizens can share apps they have built
  • business can use the marketplace, to publish the data that will power other applications.
  • cities can easily publish data about anything
  • citizens able to add data to existing datasets
  • developers have several tools for accessing indexed and structured data
  • all data added to the site is indexed as it arrives and becomes available to applications within a very short time
  • the information marketplace is a data hub providing a revenue share opportunity

Behind the lightweight demonstrator sits a technology stack that provides data hosting and integration. The simple datasets used as examples in the demo can be explored by both developers who understand working with data, and citizens with no programming background.

I could throw the names of some technologies at you, such as graph databases, geo indexes, full-text indexes and application programming interfaces using a variety of protocols, but it is the self service nature of kasabi combined with the interactive and social aspects of our demonstrator that we think will make the difference to your city.

As a city we think you probably know your citizens quite well, however I am sure that there are ways that they can surprise you. Maybe it is a loosely organised not for profit company that sets itself the mission of providing the best quality data about where to park in your city. Maybe they take data that you provide about where the parking spaces are and how often they are used and combine it with a calendar of city wide events sourced from several other data providers. Maybe they built an indispensable app that helps people to choose the best parking site in the city. Maybe it even integrates with an existing drive-sharing scheme to provide parking booking services for commuters and tourists alike.

An idea like that is only possible if the people wanting to build a data driven application have easy access to data.

Of course there is no reason why that access should be free. A car parking app might charge a small fee for the provision of the service, and that fee might be shared with the data providers and the city playing host to the data in a marketplace. Everyone gets to have a share in the success of the idea.

For cities that might have a perceived poor parking experience, an app like this might improve the imgae of the city and reposition it as an easy place to find parking. It might even change people’s parking behaviour to the better, a social impact that becomes easier to measure.

At Talis, we are keen to work with you to explore how your city data and your citizen’s data might be brought together in a marketplace that allows new business to start and thrive.

See the demo >>

Talk to us

App Fund

Talis has been working with data for many years, and helping others make the most of their information. An area that we’ve been focusing on over the past few months is applications which make rich use of data.

We found three applications which excited us, and offered them financial backing to get off the ground. For the last few months, the teams have been building up and testing out their applications, and we are watching as they get them going!

I’ll post an original roundup of the app projects below, and there will be follow-up posts here and on the Kasabi blog.

If you work on a similar project, or would like to share your ideas with us, please get in touch!

Exploring Botanical Gardens from a Smart Phone

StrongSteam: Ian Ozsvald and Kyran Dale

The StrongSteam team are working on an iPhone app that opens up new levels of exploration for visitors to botanical gardens. The app will let people access tons of information about the plants they find by taking a photo of the label. The app uses advanced character recognition to read the Latin name from descriptive labels, and pulls in data from a variety of sources to tell the user far more about the plant than could be available on signs.

They’re using the StrongSteam datamining API for matching plant labels and IDs, then using datasets in Kasabi (GeoSpecies, DBPedia and BBC Wildlife for example) to extract detailed information about plant species. The user will then be given facts, figures and other pieces of information, letting them learn far more about the plants they find interesting.

FixMyStreet (Latest from MySociety Here)

My Society: Paul Lenz and Myfanwy Nixon

Through the popular app, FixMyStreet, My Society has been giving people the ability to report damaged infrastructure to their local authorities for a few years. Using smartphones, people have been highlighting things like potholes and broken streetlights across the UK since 2008. The app is now going through a complete overhaul, upgrading to a more sophisticated, HTML5-based service. The new FixMyStreet is a more powerful, responsively designed mobile-web version of the older native apps, and uses Kasabi to store a continuously-updated list of new problem reports. The new dataset includes information about councils, kinds of damage, timestamp and status of repairs along with detailed lat/long locations.

John Peel Time Machine

Storm: Dave Kelly, Mike Ellis, and Paul Leader

Developers from Storm are putting together a time machine travelling back through some of the greatest musical events of the 20th century under the watchful eyes of the legendary BBC radio DJ, the late John Peel. Building on the dataset of John Peel Sessions, the web app will guide users’ journeys on their search for artists who appeared on the live recordings of John Peel’s long-running show.

The Time Machine will work on a timeline, giving a high-level view of the Peel sessions by year, and highlight some of the relationships amongst musical artists. Where it can, it will link to recordings of the live sessions, and provide biographical information about the artists. The time machine will also provide information about the albums and tracks featured, and point users towards playlists of sessions, which they can purchase or listen to via the likes of iTunes or Spotify.

Schema.org Déjà vu

schema-org1 The Web has been around for getting on for a couple of decades now, and massive industries have grown up around the magic of making it work for you and your organisation.  Some of it, it has to be said, can be considered snake-oil.  Much of it is the output of some of the best brains on the planet.  Where, on the hit parade of technological revolutions to influence mankind, the Web is placed is oft disputed, but it is definitely up there with fire, steam, electricity, computing, and of course the wheel.  Similar debates, are and will virtually rage, around the hit parade of web features that will in retrospect have been most influential – pick your favourites, http, XML, REST, Flash, RSS, SVG, the URL, the href, CSS, RDF – the list is a long one.

I have observed a pattern as each of the successful new enhancements to the web have been introduced, and then generally adopted.  Firstly there is a disconnect between the proponents of the new approach/technology/feature and the rest of us.  The former split their passions between focusing on the detailed application, rules, and syntax of it’s use and; broadcasting it’s worth to the world, not quite understanding why the web masses do not ‘get it’ and adopt it immediately.  This phase is then followed by one of post-hype disillusionment from the creators, especially when others start suggesting simplifications to their baby.  Also at this time back-room adoption by those who find it interesting, but are not evangelistic about it, starts to occur.  The real kick for the web comes from those back-room folks who just use this next thing to deliver stuff and solve problems in a better way.  It is the results of their work that the wider world starts to emulate, so that they can keep up with the pack and remain competitive.  Soon this new feature is adopted by the majority, because all the big boys are using it, and it becomes just part of the tool kit.

A great example of this was RSS.  Not a technological leap but a pragmatic mix of current techniques and technologies mixed in with some lateral thinking and a group of people agreeing to do it in ‘this way’ then sharing it with the world.  As you will see from the Wikipedia page on RSS, the syntax wars raged in the early days – I remember it well 0.9, 0.91, 1.0, 1.1, 2.0- 2.01, etc.  I also remember trying, not always with success, to convince people around me to use it, because it was so simple.  Looking back it is difficult to say exactly when it became mainstream, but this line from Wikipedia gives me a clue: In December 2005, the Microsoft Internet Explorer team and Microsoft Outlook team announced on their blogs that they were adopting the feed icon first used in the Mozilla Firefox browser. In February 2006, Opera Software followed suit.  From then on, the majority of consumers of RSS were not aware of what they were using and it became just one of the web technologies you use to get stuff done.

I am now seeing the pattern starting to repeat itself again, with structured and linked data.  Many, including me, have been evangelising the benefits of web friendly, structured, linked data for some time now – preaching to a crowd that has been slow in growing, but growing it is.   Serious benefit is now being gained by organisations adopting these techniques and technologies, as our selection of case studies demonstrate.  They are getting on with it, often with our help, using it to deliver stuff.  We haven’t hit the mainstream yet.  For instance, the SEO folks still need to get their head around the difference between content and data. 

Something is stirring around the edge of the Semantic Web/Linked Data community  that has the potential to give structured web enabled data the kick towards mainstream that RSS got when Microsoft adopted the RSS logo and all that came with it.   That something is schema.org, an initiative backed by the heavyweights of the search engine world, Google, Yahoo, and Bing.  For the SEO and web developer folks, schema.org offers a simple attractive proposition – embed some structured data in your html and, via things like Google’s Rich Snippets, we will give you a value added display in our search results.  Result, happy web developers with their sites getting improve listing display.  Result, lots of structured data starting to be published by people that you would have had an impossible task in convincing that it would be a good idea to publish structured data on the web.

I was at Semtech in San Francisco in June, just after schema.org was launched and caused a bit of a stir.  They’ve over simplified the standards that we have been working on for years, dumbing down RDF, diluting the capability, with to small a set of attributes, etc., etc.  When you get under the skin of schema.org, you see that with support for RDFa and supporting RDFa 1.1 lite, they are not that far from the RDF/Linked Data community.

Schema.org should be welcomed as an enabler for getting loads more structured and linked data on the web.  Is their approach now perfect,? No.  Will it influence the development of Linked Data? Yes.  Will the introduction be messy? Yes.  Is it about more than just rich snippets?  Oh yes.  Do the webmasters care at the moment? No.

If you want a friendly insight in to what schema.org is about, I suggest a listen to this month’s Semantic Link podcast, with their guest from Google/schema.org Ramanathan V. Guha. 

Now where have I seen that name before? – Oh yes, back on the Wikipedia RSS pageThe basic idea of restructuring information about websites goes back to as early as 1995, when Ramanathan V. Guha and others in Apple Computer’s Advanced Technology Group developed the Meta Content Framework.”  So it probably isn’t just me who is getting a feeling of Déjà vu.

Triplification

Why I think triplification is an unhelpful term

I have heard the term ‘triplification’ used as being yet another ETL process that you need to do to get your existing data into a form that is ready for the semantic web. Claims such as “You don’t have to struggle with ontologies, mapping languages, logics or other scary things” [http://triplify.org/About] hides some complexity that will either make or break the utility of the data represented as triples.

My thought is that the data needs to be modelled well, and I am afraid that you will need to take time to understand the things you have data about. I have seen examples of triplified datasets where concepts and their relationships look like they have just been reformatted from a database table.

Simply representing the source data using one triple per intersection of column and row is not useful. It essentially models the data as a table but expressed as triples. A simple mapping of column names to property names may reinforce this very basic model. I have seen property names such as country1 and country2 which begs the question why those two things are not the same. The likely cause was one or more tables which talked about the same thing in different contexts.

What is the alternative?

Look at the data and identify which things are objects of interest, then define relationships between these objects. This isn’t just triplification, this is modelling the data without context.

  • Context is the source data format.
  • Context is the place the source data value sits alongside other values.
  • Context is the application that might be using the data.
  • Context is the use case that allowed your project to get funding.
  • Context is the place the dataset is likely to sit in future projects.

Rob Styles recently used these slides to show the difference between a simple CSV2RDF approach and a more expressive data modelling exercise.

The intervention of humans who are domain experts in the data modelling process, is the key ingredient for making your data ready for the semantic web. It will also make your data more readily re-useable.

Whether your data comes from sensors monitoring sea quality; observations of how a protein reacts to the presence of chemical compounds; academic courses run in this academic year; the location of traffic census points, it is all data that describes values and conditions of real world things.

Non of that data describes how your application should use the data, nor what it looked like in the source, nor how your project is going to use it.

So rather than rushing to triplify your data. Be the first to describe the relationships of the real world things that you know the most about.

Get your model right and the novel uses for your data will be much easier to find.

 

US Environmental Protection Agency Prepares Facilities Data for the Linked Data Cloud

Publishing high quality Government data is part of the US Administration’s Open Government Initiative.  Linked Open Data reduces IT costs because it is easier to integrate data between agency programs and government organizations using International Data Standards for the Web.  As more and more Government Agencies publish high quality data sets, the objectives of the Open Government Initiative, to strengthen democracy, and to promote efficiency and effectiveness of government, are realized.

Talis recently assisted the Environmental Protection Agency’s Office of Environmental Information to model and publish an important data set called the Facility Registry System (FRS) as high quality Linked Open Data.  FRS is a source of comprehensive environmental information (e.g., air, water, and waste) for facilities, sites, and places.  This recent effort assisted the Agency to better understand the process of publishing high quality data sets that are increasingly being made available through data.gov.

The EPA brought together Linked Data experts from Talis, along with FRS subject matter experts from EPA and their contractors.   We held face to face sessions during February and March 2011 to walk through the data modeling process.  Data modeling involved a two day deep dive on the existing relational database model, followed by an interim model review.   The team discussed strategies for URI selection, vocabulary re-use, creation, and query support via SPARQL.  Talis delivered the modeling guide, along with the RDF for 2.6M Facilities of interest to the EPA, which was represented in approximately 103M triples.  Together with EPA, we identified next steps and reviewed future maintenance activities.  US EPA now has a ‘reference implementation’ for a high quality data set modeled and converted to RDF that can be used for future Linked Open Data initiatives.

One of the major insights gained upon completion of the project was how easy it was to put the data into any standards compliant RDF database.  There were no complex conversion scripts to load the data into different databases.  Using International Data Standards, it is trivial to load RDF into a store, query and visualize the data in literally one day. This same effort using a traditional 3 tier approach using relational technologies would have taken six months or longer.  This is a major benefit and cost savings to federal systems integrators and Government Agencies, resulting in significant cost savings and time efficiencies.

Another major insight for the participants was being able to view their data through a variety of freely available Linked Data visualization tools.   Several of the tools demonstrated were developed by or with support from Talis including LinkSailor, Talis Platform. Callimachus and Morph.  Two other useful visualization tools demonstrated were Spark and Exhibit, both are available as Open Source.

The first thing I looked at was, “what facilities of interest are in my backyard?”  We all care about our communities.   With more and more government agencies publishing data we can readily understand and visualize via the Web, the more we as citizens are able to make informed decisions.  The easier it is for government authorities and public industry to access and visualize high quality data published by authoritative sources such as the US EPA, the more efficiently they can manage their limited resources, and do so responsibly.  With projects such as the EPA’s Facilities Registry published as high quality Linked Open Data, everyone wins.  The EPA, other government agencies, citizens, journalists and bloggers will all be able view valuable data and create mash ups with well-modeled Linked Open Data sets.  We look forward to the EPA Facilities data set being made available via data.gov in the near future.

Linking and Cleaning Data Part 2

I noted in the first post in this mini series that the original “can you just add in links to the OS post code data” task threw up some surprises and extra issues that needed attention. In particular, the data used in the Research Funding Explorer used a very simple data model that assumed a single location for each participating business. If you ask the SPARQL service to DESCRIBE <http://business.data.gov.uk/id/company/1996687> you’ll see that <http://business.data.gov.uk/id/company/1996687> identifies BAE Systems which has, among others, 9 rdfs:label properties, 12 addressLine1 properties, and that BAE Systems is located in the counties of Bristol, Essex, Hants, Kent, Lancashire and Lancashires!

Oh boy, this is going to take some sorting out. And because the data for the multiple locations of BAE systems is all mashed in together, we’re going to need to sort it out by hand and with reference to BAE’s Web site to find out which address line 2 goes with which address line 1 and so on. This is where the organisation ontology really comes in to its own. It models the concept of a single legal entity with a formal name and an informal one (using the well-known skos:preLabel and skos:altLabel properties) and of things like a primary site and other sites (org:hasPrimarySite and org:hasSite respectively). Then you can say something about each of those sites including giving its address – for which the ontology requires the use of VCard.

I am not alone in disliking VCard. The RDF encoding created by Harry Halpin et al is a very good encoding of VCard itself but that’s not to say it’s a good encoding of addresses! I raised a question on the Linked Open data mailing list about this. Follow the thread if you like but what comes out, as ever, is that existing implementation and re-use of vocabularies beats fitness for purpose every time. In other words, many people agree that VCard is not a good model for addresses (at least, for non-US addresses), however, it’s what is widely used and therefore it’s what everyone uses. The org ontology is right for modelling organisations so I wanted to use it, but I used the VCard ontology with some reluctance. Be that as it may, I ended up with a much better model of, for example, BAE systems:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rs: <http://research.data.gov.uk/def/project/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix v: <http://www.w3.org/2006/vcard/ns#> .
@prefix org: <http://www.w3.org/ns/org#> .
@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .

<http://business.data.gov.uk/id/company/1996687>  a org:FormalOrganization;
  skos:altLabel "BAE Systems" ;
  skos:prefLabel "BAE Systems (Operations) Ltd" ;
  rdfs:seeAlso <http://opencorporates.com/companies/uk/1996687> ;
  org:identifier "25.40"^^<http://research.data.gov.uk/def/project/SICCode>, "2713"^^<http://research.data.gov.uk/def/project/ICBCode>, "30.30"^^<http://research.data.gov.uk/def/project/SICCode> ;

  org:hasSite  <http://business.data.gov.uk/id/company/1996687/BB27LF> ;
  org:hasSite  <http://business.data.gov.uk/id/company/1996687/BS347QW> ;
  org:hasSite  <http://business.data.gov.uk/id/company/1996687/CM28HN> ;
  org:hasSite  <http://business.data.gov.uk/id/company/1996687/PR41AX> ;
  org:hasSite  <http://business.data.gov.uk/id/company/1996687/ME12XX> .

<http://business.data.gov.uk/id/company/1996687/BS347QW> a org:Site ;
  org:siteAddress <http://business.data.gov.uk/id/company/1996687/BS347QW/vcard> ;
  geo:lat "51.5265302" ;
  geo:long "-2.5622897" .

<http://business.data.gov.uk/id/company/1996687/BS347QW/vcard> a v:VCard ;
  v:fn "BAE Systems" ;
  v:geo <http://business.data.gov.uk/id/company/1996687/BS347QW/vcard/geo> ;
  v:adr <http://business.data.gov.uk/id/company/1996687/BS347QW/vcard/adr> .

<http://business.data.gov.uk/id/company/1996687/BS347QW/vcard/geo> a v:Location ;
  v:latitude "51.5265302" ;
  v:longitude "-2.5622897" .

<http://business.data.gov.uk/id/company/1996687/BS347QW/vcard/adr> a v:Address ;
  v:post-office-box "PO Box 5" ;
  v:extended-address "New Filton House" ;
  v:street-address "Filton" ;
  v:locality "Bristol" ;
  v:postal-code "BS34 7QW" ;
  v:country-name "United Kingdom" ;
  <http://data.ordnancesurvey.co.uk/ontology/postcode/postcode> <http://data.ordnancesurvey.co.uk/id/postcodeunit/BS347QW> ;
  v:label """BAE Systems
New Filton House
PO Box 5
Filton
Bristol
BS34 7QW
United Kingdom""" .
…
More triples to describe the other locations.

Here we see org:hasSite properties pointing to nodes for each of 5 different sites where we can use the VCard structures to give address information. It may appear to be a little long winded but it is very clear. Furthermore, it allows us to include links to the OS post code data in a meaningful way… which was the point of the exercise in the first place.

Also, notice that I’ve included a link to the Open Corporates URI for this company. Sensibly, Open Corporates uses the Companies House registration number in the identifier for each company, as the original data did here so that for the identifier <http://business.data.gov.uk/id/company/1996687> we have an rdfs:seeAlso link to <http://opencorporates.com/companies/uk/1996687>. As an aside, notice that the value for the v:label is triple quoted. This is a long string, i.e. one that can include line breaks, and therefore (according to the Turtle spec) needs the triple quotes. For VCard, the label is an address label suitable for printing and sticking on a letter so it allows you to include a real address formatted the way you want cf. the way VCard demands.

OK, this is starting to look easy and ‘right’ – I’m feeling comfortable. So I set up a little tool to let me see all the companies with multiple rs:location properties. I did this using the following SPARQL query which includes some features in the SPARQL 1.1 spec that the Talis Platform already implements:

SELECT ?firm (count(DISTINCT ?location) as ?count)
WHERE {
  ?firm <http://research.data.gov.uk/def/project/location> ?location .
}
GROUP BY ?firm
ORDER BY DESC(?count)

This helped a lot. It gave me a list of companies I could look at, remodel the data and write out new descriptions for – all the while thinking in terms of adding new triples, not deleting any of them (remember the demo works – don’t fix it!). Although it was possible, this process was time consuming, especially since the original data didn’t include company registration numbers for all the businesses. In such cases, the data set used the company name in the identifier: <http://business.data.gov.uk/id/company/CambridgeDisplayTechnologyLimited> for example. But that threw up yet another problem since if a company was involved in more than one research project, and many of them were, any variance or error in the way the name was typed created a new record. So for Cambridge Display Technology we had 2 quite separate entries. Smith & Nephew (Smith and Nephew Ltd, Smith and Nephew Limited, Smith & Nephew UK Ltd…) had 5 separate identifiers! I handled this by including owl:sameAs statements to link the various identifiers for the same businesses but, nonetheless, included the org:hasSite property for each alias for ease of querying.

Where a company had just one location, the process of generating the new triples to use the org ontology, add in the links to the OS post code and Open Corporates data sets was much more amenable to automation. Even so, an element of manual editing was necessary, especially where it was necessary to match up different descriptions of the same business or handle businesses about which the data was incomplete. For some businesses there was no address data at all and it wasn’t always possible to find an unambiguous entry in the Open Corporates data set. For example, search for Cambridge University and you’ll see 9 entries in the Open Corporates data set – and none of them seem to be relevant to our needs here. Conversely, Companies House (on whose UK data Open Corporates is based) only covers registered companies, not sole traders, so the data is not a complete list of every trading entity.

I ended up generating and then hand-editing turtle files for each business (validated in turn using Joshua Tauberer’s validator). I was then able to use a combination of Redland’s rapper utility and a simple cat command to generate a file with all the new triples that I wanted.

Only now did I go through and correct the erroneous post codes (there are only a few of these remember so a simple search and replace was all that was needed). The original text-only post code data, like all the original data, remains unaffected – and therefore the demo (and any other applications) won’t break.

As a final action, I looked at the patents data. This is a much simpler data set that just gives minimal information about patents holders. The address data is given as a single string so here I just ran a script that (using a regular expression) pulled out the post codes and checked that they exist in the Ordnance Survey data set. There were rather more 404s than we got for the main businesses data but only a handful of typos (the letter O where it should be a zero, a letter I where there should be a number 1), so this part of the process was achieved in a very short time.

In summary, the data available from business.data.gov.uk and patents.data.gov.uk, concerns companies involved with research projects and the outcome of those projects. As a result of the work done in recent weeks it now has:

  • better modelling of each company, using Dave Reynolds’ org ontology;
  • links, where possible, to the Ordnance Survey post code data;
  • links, where possible, to the Open Corporates data;
  • fewer errors.

In the final post in this mini series, I’ll look at some queries that can be run against the new data.