Alan Dix, Talis, 10th March 2011
This is an experiment in creating a meta description for CSV files (or tables in general). It relates to ongoing discussions at Talis about ways to talk abut data as well as represent the data itself.
There are basically three levels where one can use semantic descriptions (read RDF!) with data:
This mini-project is addressed at level 2 above (with just a hint of 3) and at lifting tabular data to Tim Berners Lee's 5 star rating (addressable by URI at component level and linked to external standard resources) [Ha10, Su10]. Because this at level 2, when the underlying tabular data changes the meta-description can stay the same so that (inter-alia) fresh RDF can be generated. However, also by being a meta-description the underlying data can be rendered in RDF in different ways (see multiple examples below), or not materialised at all and potentially queried uniformly in a federated manner with other RDF or level 2 described non-RDF resources.
While are various tools for importing CSV files into RDF (e.g. csv2rdf); that is addressing level 1. In contrast, this mini-project is about addressing level 2 (with just a hint of 3). RDF123 is hitting a similar level with graphical specification of row formats [HF08]. RDF123 does not deal with linking to external resources (TBL 4 star but not 5 star), but more recent work by the same group has been mining Wikipedia articles to help create semantic links in tables [FS10]. Southampton's Grinder is also similar, but so far aiming at 4 star rating only [Gu10, Gu10b, ET21]. Another recent system csv2rdf4lod seems to be addressing similar issues, devloped as part of the work of the TWC group at Rensselaer Polytechnic Institute on the Linking Open Government Data portal – I need to look at this in more detail, especially their description ontology.
Berners Lee on the 5 star rating
The linked table data at star-scheme-by-example, which demonstrates Tim Berners Lee's 5 star rating for open government data [Ha10, Su10], has been used as a driving example. The core question is whether this kind of additional linking can be achieved by some form of generic description about the table data without having to hand edit/clean individual entries, and furthermore allowing data to be re-imported (or dynamically accessed) on demand without fresh effort. This is in many ways like D2RQ, except also automating outward linking (e.g. country code column linking to external data sets)
The CSV file temp_data.txt (below) is used as a starting point, it takes the data from star-scheme-by-example, but adds the place name as an additional column.
Galway,"Saturday, 13 November 2010",2 . Galway,"Sunday, 14 November 2010",4 Galway,"Monday, 15 November 2010",7 Isle of Tiree,"Tuesday, 8 March 2011",7 Isle of Tiree,"Wednesday, 9 March 2011",5
A table specification gives names and types to the columns (but can be omitted or generated from table header row), and this can be used, amongst other things, to create the turtle RDF version of the table at: http://tiree.snipit.org/talis/csvmapper/temp/all/ (add ?format={ turtle | json | xml | html | ntriples } for different formats and mime={mimetype} or encode=html to help viewing in browser).
All the URLs for columns, rows etc. are live LOD, for instance the day field of row 3 (starting counting at 1) is at csv:temp/row/3/field/day, (note using the prefix 'csv:' for 'http://tiree.snipit.org/talis/csvmapper/'). This returns an owl:sameas link to csv:temp/row/3/col/2, which in turn links to the actual date.
Several data files can be configured to use the same table specification; csv:temp2/ uses the same specification applied to the data file temp2_data.txt.
Note that in the date in csv:temp/row/3/col/2 is encoded both as entered in the text and also converted to XSD standard format:
<http://tiree.snipit.org/talis/csvmapper/temp/row/3/col/2> msg0:has_value "Monday, 15 November 2010" ;
msg0:as_xsd_date "2010-11-15T00:00:00+00:00"^^<xsd:dateTime> .
This mapping between date as entered and xsd:dateTime ids managed by generic plugable rules. These rules can chain, so that the place field gets transformed to a geonames Feature name [Ge10] and then linked to matching feature resources (csv:temp/row/4/col/1):
<http://tiree.snipit.org/talis/csvmapper/temp/row/4/col/1>
msg0:has_value "Isle of Tiree" ;
msg0:as_geoname_feature_name "Tiree" ;
msg0:as_geoname_feature <http://sws.geonames.org/2635781/> ,
<http://sws.geonames.org/6296635/> ,
<http://sws.geonames.org/2635780/> .
This is giving 5 star linking to geonames. The linking is enabled because the table specification has a type 'place_name' for the column and this triggers the matching rules.
Note that the table contained the name "Isle of Tiree", which does not occur in the Geonames database, even as an alternativeName. However, an exceptions list can be added for each mapping (in this case saying that "Isle of Tiree" should be transformed to "Tiree"). Potential links are generated automatically but they can be overridden. Indeed in this example there are a lot of matches for Galway. This can be dealt with using the exception list to disambiguate, but maybe the tool should automatically choose one a 'best' option.
Exception lists can be tailored for a particular data set; for example, while csv:temp2/ uses the same specification as csv/temp/ it has a different exception list that maps "Isle of Tiree" to "Crossapol".
So far the RDF structure follows closely the raw table structure. However, a more complex output can be seen at http://tiree.snipit.org/talis/csvmapper/temp/all/meteo, which builds a structure using the meteorological ontology [Pa11] at http://purl.org/ns/meteo.
This follows the example in star-scheme-by-example, minting entities that are not simply the lines and columns of the original table. There is a distinct place entity for each named place, for example, csv:temp/entities/#place/Galway and two entities minted for each row one for the forecast itself and one for the temperature as determined by the ontology.
<http://tiree.snipit.org/talis/csvmapper/temp/entities/#forecast_50b3a39570fac178401699bd163b2999>
meteo:predicted "2010-11-13T00:00:00+00:00"^^<xsd:dateTime> ;
meteo:temperature <http://tiree.snipit.org/talis/csvmapper/temp/entities/#temp_50b3a39570fac178401699bd163b2999> .
<http://tiree.snipit.org/talis/csvmapper/temp/entities/#temp_50b3a39570fac178401699bd163b2999>
meteo:celsius "2 ." ;
tempuri:units "celsius"^^<xsd:string> .
As well as raw CSV files, Google Fusion tables can be used as a data source; for example, csv:flags/ shows this applied to the Country Flags table (be patient no caching at present, so a bit slow!).
You can install specification files on your own web space and link to the mapper. A full set of examples can be downloaeded from csv_mapper_remote.zip (7K) including both plain CSV files and Google Fusion tables. There is an example remote CSV installation at http://meandeviation.com/talis/remotecsv/remote_temp/ (see its configuration file) and also a Google Fusion example at http://meandeviation.com/talis/remotecsv/flags/ (configuration file). The installation is using a simple Apache redirect, so the mapping URL is revealed after you click through the URL, but an Apache proxy or small script can hide this if desired.
The table specification (below) is given in turtle and is at temp_spec.txt .
@prefix dt: <http://example.com/datatypes/> .
@prefix units: <http://example.com/units/> .
@prefix ts: <http://example.com/ts/> .
@prefix : <http://temp.example.com/spec/> .
@prefix tt: <http://temp.example.com/spec//tables/temp> .
tt: ts:columns ( tt:/cols/place tt:/cols/day tt:/cols/temp ).
tt:/cols/place ts:name "place" ;
ts:title "Location" ;
ts:type dt:place_name .
tt:/cols/day ts:name "day" ;
ts:title "Day" ;
ts:type dt:date .
tt:/cols/temp ts:name "lowest_temp" ;
ts:title "Lowest Temperature (°C)" ;
ts:type dt:number ;
ts:represents dt:temperature ;
ts:units units:celsius .
Exceptions can be specified with respect to each mapping. This would be expected to be done using an interactive tool, but is currently in the file exceptions.txt (below). This says that whenever applying the 'as_geoname_feature_name' mapping to the column 'place' in the table 'csv:temp/', the system will map "Isle of Tiree" to "Tiree". The 'path' to the transformation means that other references to "Isle of Tiree" (for example if there were a title) are not affected. Furthermore, as the exception is applied to the mapping 'as_geoname_feature_name', not the 'place' field itself, the original data as entered is preserved in raw form if required.
@prefix dt: <http://example.com/datatypes/> . @prefix units: <http://example.com/units/> . @prefix ts: <http://example.com/ts/> . @prefix tt: <http://temp.example.com/spec/tables/temp> . @prefix tab: <http://tiree.snipit.org/talis/csvmapper/temp/> . @prefix : <http://tiree.snipit.org/talis/data/exceptions.txt> . : ts:exceptions ( [ ts:ex/path ( tab: tt:/cols/place ts:as_geoname_feature_name ) ; ts:ex/mapping ( [ ts:ex/from "Isle of Tiree" ; ts:ex/to "Tiree" ; ] ) ] )
A slightly more complicated configuration (below and in temp_ext_spec.txt) is required to specify the connection with the meteorological ontology. The first part labeled 'entities' is about the minting of fresh entities, and the second part labeled 'generate' describes patterns for the triples generated (rather like a SPARQL construct). Note special schemes ('var:' and 'col:') are used to denote 'variables' of different kinds.
@prefix dt: <http://example.com/datatypes/> . @prefix units: <http://example.com/units/> . @prefix ts: <http://example.com/ts/> . @prefix tt: <http://temp.example.com/spec/tables/temp> . @prefix ext: <http://temp.example.com/spec/extend/meteo/> . @prefix gn: <http://www.geonames.org/ontology#> . @prefix meteo: <http://purl.org/ns/meteo#> . @prefix : <http://temp.example.com/spec/extend/meteo/> . ext: ts:refersto ts:tables/temp ; ts:foreachrow [ ts:entities ( [ ts:name "myplaceuri" ; ts:identifiedby ( "place" ); ts:relativePrefixUri "#place/"; ts:linkFromRow ts:refers_to ; ts:linkToRow ts:referred_to_by ] [ ts:name "forecasturi" ; ts:identifiedby ( "place" "day" ); ts:hash "true"; ts:relativePrefixUri "#forecast_" ; ts:linkFromRow ts:represents ] [ ts:name "tempuri" ; ts:identifiedby ( "place" "day" ) ; ts:hash "true" ; ts:relativePrefixUri "#temp_" ] ) ; ts:generate ( ( "var:myplaceuri" ts:prop/myplace/rawname "col:place" "inherit" ) ( "var:myplaceuri" gn:name "col:place/asGeonameFeatureName" "inherit" ) ( "var:myplaceuri" <owl:sameAs> "col:place/asGeonameFeatureName/asGeonameFeature" ) ( "var:myplaceuri" <meteo:forecast> "var:forecasturi" ) ( "var:forecasturi" <meteo:predicted> "col:day/asXsdDate" "inherit" ) ( "var:forecasturi" <meteo:temperature> "var:tempuri" ) ( "var:tempuri" meteo:celsius "col:lowest_temp" "override" "" "xsd:decimal" ) ( "var:tempuri" ts:prop/tempuri/units "celsius" "literal" "" "xsd:string" ) ) ].
All the above are linked with a single configuration file (below and at temp.conf.txt), that simply points to the relevant parts of the specification. Note the tc:uri and tc:file are there becasue the storage location (file) and logical name (uri) may be different ... especially as it is still full of 'example.com' namespaces :-(
For a Google Fusion data set simply set the tc:source to point to the data set page such as http://www.google.com/fusiontables/DataSource?dsrcid=86424 . Do make sure it is the original dataset page not a 'snap' ('dsrcid' rather than 'snapid' in the url). While snaps are dealt with they are less efficent.
@prefix tc: <http://example.com/ts/config/> . @prefix : <http://tiree.snipit.org/talis/data/config/temp.conf.txt> . : tc:base <http://tiree.snipit.org/talis/csvmapper/temp/> ; tc:source <http://tiree.snipit.org/talis/data/temp_data.txt> ; tc:tablespec [ tc:uri <http://temp.example.com/spec/tables/temp> ; tc:file <http://tiree.snipit.org/talis/data/temp_spec.txt> ] ; tc:extensions ( [ tc:uri <http://temp.example.com/spec/extend/meteo/> ; tc:file <http://tiree.snipit.org/talis/data/temp_ext_spec.txt> ] ) ; tc:exceptions ( [ tc:uri <http://tiree.snipit.org/talis/data/exceptions.txt> ; tc:file <http://tiree.snipit.org/talis/data/exceptions.txt> ] ) ; tc:options () .