CSV Meta Description

Alan Dix, Talis, 10th March 2011

updated
11/3/2011: turtle version of main specification files added
12/3/2011: minor edits, ref to Grinder
14/3/2011: support for user's own configuration files and Google Fusion data files
19/3/2011: added ref to csv2rdf4lod

Introduction

This is an experiment in creating a meta description for CSV files (or tables in general). It relates to ongoing discussions at Talis about ways to talk abut data as well as represent the data itself.

There are basically three levels where one can use semantic descriptions (read RDF!) with data:

  1. representing the data itself (ingest into RDF)
  2. talking about the internals of semi-structured data
    (e.g. third column is population in millions, 1st col is country name)
    this can mean you can do (1) on demand if required
  3. talking about the data as a data set - as library data for books
    (e.g. this is a data set produced by the home office, it is about
    health, regions of the UK, cancer, nutrition )
    This would include provenance, etc.

This mini-project is addressed at level 2 above (with just a hint of 3) and at lifting tabular data to Tim Berners Lee's 5 star rating (addressable by URI at component level and linked to external standard resources) [Ha10, Su10]. Because this at level 2, when the underlying tabular data changes the meta-description can stay the same so that (inter-alia) fresh RDF can be generated. However, also by being a meta-description the underlying data can be rendered in RDF in different ways (see multiple examples below), or not materialised at all and potentially queried uniformly in a federated manner with other RDF or level 2 described non-RDF resources.

Related Systems

While are various tools for importing CSV files into RDF (e.g. csv2rdf); that is addressing level 1. In contrast, this mini-project is about addressing level 2 (with just a hint of 3). RDF123 is hitting a similar level with graphical specification of row formats [HF08]. RDF123 does not deal with linking to external resources (TBL 4 star but not 5 star), but more recent work by the same group has been mining Wikipedia articles to help create semantic links in tables [FS10]. Southampton's Grinder is also similar, but so far aiming at 4 star rating only [Gu10, Gu10b, ET21]. Another recent system csv2rdf4lod seems to be addressing similar issues, devloped as part of the work of the TWC group at Rensselaer Polytechnic Institute on the Linking Open Government Data portal – I need to look at this in more detail, especially their description ontology.

A Live Example


Berners Lee on the 5 star rating

The linked table data at star-scheme-by-example, which demonstrates Tim Berners Lee's 5 star rating for open government data [Ha10, Su10], has been used as a driving example. The core question is whether this kind of additional linking can be achieved by some form of generic description about the table data without having to hand edit/clean individual entries, and furthermore allowing data to be re-imported (or dynamically accessed) on demand without fresh effort. This is in many ways like D2RQ, except also automating outward linking (e.g. country code column linking to external data sets)

The CSV file temp_data.txt (below) is used as a starting point, it takes the data from star-scheme-by-example, but adds the place name as an additional column.

Galway,"Saturday, 13 November 2010",2 .
Galway,"Sunday, 14 November 2010",4
Galway,"Monday, 15 November 2010",7
Isle of Tiree,"Tuesday, 8 March 2011",7
Isle of Tiree,"Wednesday, 9 March 2011",5

A table specification gives names and types to the columns (but can be omitted or generated from table header row), and this can be used, amongst other things, to create the turtle RDF version of the table at: http://tiree.snipit.org/talis/csvmapper/temp/all/ (add ?format={ turtle | json | xml | html | ntriples } for different formats and mime={mimetype} or encode=html to help viewing in browser).

All the URLs for columns, rows etc. are live LOD, for instance the day field of row 3 (starting counting at 1) is at csv:temp/row/3/field/day, (note using the prefix 'csv:' for 'http://tiree.snipit.org/talis/csvmapper/'). This returns an owl:sameas link to csv:temp/row/3/col/2, which in turn links to the actual date.

Several data files can be configured to use the same table specification; csv:temp2/ uses the same specification applied to the data file temp2_data.txt.

Value and Resource Mappings

Note that in the date in csv:temp/row/3/col/2 is encoded both as entered in the text and also converted to XSD standard format:

<http://tiree.snipit.org/talis/csvmapper/temp/row/3/col/2> msg0:has_value "Monday, 15 November 2010" ;
                                                           msg0:as_xsd_date "2010-11-15T00:00:00+00:00"^^<xsd:dateTime> .

This mapping between date as entered and xsd:dateTime ids managed by generic plugable rules. These rules can chain, so that the place field gets transformed to a geonames Feature name [Ge10] and then linked to matching feature resources (csv:temp/row/4/col/1):

<http://tiree.snipit.org/talis/csvmapper/temp/row/4/col/1>
                                        msg0:has_value "Isle of Tiree" ;
                                        msg0:as_geoname_feature_name "Tiree" ;
                                        msg0:as_geoname_feature <http://sws.geonames.org/2635781/> ,
                                                                <http://sws.geonames.org/6296635/> ,
                                                                <http://sws.geonames.org/2635780/> .

This is giving 5 star linking to geonames. The linking is enabled because the table specification has a type 'place_name' for the column and this triggers the matching rules.

Note that the table contained the name "Isle of Tiree", which does not occur in the Geonames database, even as an alternativeName. However, an exceptions list can be added for each mapping (in this case saying that "Isle of Tiree" should be transformed to "Tiree"). Potential links are generated automatically but they can be overridden. Indeed in this example there are a lot of matches for Galway. This can be dealt with using the exception list to disambiguate, but maybe the tool should automatically choose one a 'best' option.

Exception lists can be tailored for a particular data set; for example, while csv:temp2/ uses the same specification as csv/temp/ it has a different exception list that maps "Isle of Tiree" to "Crossapol".

Adding More Semantics – the Metorological Ontology

So far the RDF structure follows closely the raw table structure. However, a more complex output can be seen at http://tiree.snipit.org/talis/csvmapper/temp/all/meteo, which builds a structure using the meteorological ontology [Pa11] at http://purl.org/ns/meteo.

This follows the example in star-scheme-by-example, minting entities that are not simply the lines and columns of the original table. There is a distinct place entity for each named place, for example, csv:temp/entities/#place/Galway and two entities minted for each row one for the forecast itself and one for the temperature as determined by the ontology.

<http://tiree.snipit.org/talis/csvmapper/temp/entities/#forecast_50b3a39570fac178401699bd163b2999>
        meteo:predicted "2010-11-13T00:00:00+00:00"^^<xsd:dateTime> ;
        meteo:temperature <http://tiree.snipit.org/talis/csvmapper/temp/entities/#temp_50b3a39570fac178401699bd163b2999> .
<http://tiree.snipit.org/talis/csvmapper/temp/entities/#temp_50b3a39570fac178401699bd163b2999>
        meteo:celsius "2 ." ;
        tempuri:units "celsius"^^<xsd:string> .

Google Fusion Data

As well as raw CSV files, Google Fusion tables can be used as a data source; for example, csv:flags/ shows this applied to the Country Flags table (be patient no caching at present, so a bit slow!).

Try It Yourself!

You can install specification files on your own web space and link to the mapper. A full set of examples can be downloaeded from csv_mapper_remote.zip (7K) including both plain CSV files and Google Fusion tables. There is an example remote CSV installation at http://meandeviation.com/talis/remotecsv/remote_temp/ (see its configuration file) and also a Google Fusion example at http://meandeviation.com/talis/remotecsv/flags/ (configuration file). The installation is using a simple Apache redirect, so the mapping URL is revealed after you click through the URL, but an Apache proxy or small script can hide this if desired.

Configuration file examples

Table specification

The table specification (below) is given in turtle and is at temp_spec.txt .

@prefix dt: <http://example.com/datatypes/> .
@prefix units: <http://example.com/units/> .
@prefix ts: <http://example.com/ts/> .
@prefix : <http://temp.example.com/spec/> .
@prefix tt: <http://temp.example.com/spec//tables/temp> .
tt:  ts:columns ( tt:/cols/place tt:/cols/day tt:/cols/temp ).
tt:/cols/place    ts:name         "place" ;
	              ts:title        "Location" ;
                  ts:type         dt:place_name .
tt:/cols/day      ts:name         "day" ;
                  ts:title        "Day" ;
                  ts:type         dt:date .
tt:/cols/temp     ts:name         "lowest_temp" ;
                  ts:title        "Lowest Temperature (°C)" ;
                  ts:type         dt:number ;
                  ts:represents   dt:temperature ;
                  ts:units        units:celsius .	

Exceptions file

Exceptions can be specified with respect to each mapping. This would be expected to be done using an interactive tool, but is currently in the file exceptions.txt (below). This says that whenever applying the 'as_geoname_feature_name' mapping to the column 'place' in the table 'csv:temp/', the system will map "Isle of Tiree" to "Tiree". The 'path' to the transformation means that other references to "Isle of Tiree" (for example if there were a title) are not affected. Furthermore, as the exception is applied to the mapping 'as_geoname_feature_name', not the 'place' field itself, the original data as entered is preserved in raw form if required.

@prefix dt: <http://example.com/datatypes/> . 										
@prefix units: <http://example.com/units/> . 								
@prefix ts: <http://example.com/ts/> . 										
@prefix tt: <http://temp.example.com/spec/tables/temp> .
@prefix tab: <http://tiree.snipit.org/talis/csvmapper/temp/> . 
@prefix : <http://tiree.snipit.org/talis/data/exceptions.txt> . 										

:  ts:exceptions (	[	ts:ex/path		( tab:  tt:/cols/place  ts:as_geoname_feature_name ) ;
						ts:ex/mapping	( [ ts:ex/from "Isle of Tiree" ; ts:ex/to "Tiree" ; ] )
					]
				)

Meteorology Ontology Extension

A slightly more complicated configuration (below and in temp_ext_spec.txt) is required to specify the connection with the meteorological ontology. The first part labeled 'entities' is about the minting of fresh entities, and the second part labeled 'generate' describes patterns for the triples generated (rather like a SPARQL construct). Note special schemes ('var:' and 'col:') are used to denote 'variables' of different kinds.

@prefix dt: <http://example.com/datatypes/> . 										
@prefix units: <http://example.com/units/> . 								
@prefix ts: <http://example.com/ts/> . 										
@prefix tt: <http://temp.example.com/spec/tables/temp> . 						
@prefix ext: <http://temp.example.com/spec/extend/meteo/> . 						
@prefix gn: <http://www.geonames.org/ontology#> .
@prefix meteo: <http://purl.org/ns/meteo#> .
@prefix : <http://temp.example.com/spec/extend/meteo/> . 										

ext:	ts:refersto  ts:tables/temp ;
		ts:foreachrow [ 
					ts:entities (
						[ ts:name "myplaceuri" ; ts:identifiedby ( "place" ); ts:relativePrefixUri "#place/";
						  ts:linkFromRow  ts:refers_to ;  ts:linkToRow  ts:referred_to_by
						]
						[ ts:name "forecasturi" ; ts:identifiedby ( "place" "day" ); ts:hash "true";
						  ts:relativePrefixUri "#forecast_" ;  ts:linkFromRow  ts:represents
						]
						[ ts:name "tempuri" ; ts:identifiedby ( "place" "day" ) ; ts:hash "true" ;
						  ts:relativePrefixUri "#temp_"
						]
					) ;
		            ts:generate (
						  ( "var:myplaceuri"  ts:prop/myplace/rawname  "col:place"  "inherit" )
						  ( "var:myplaceuri"  gn:name  "col:place/asGeonameFeatureName"  "inherit" )
						  ( "var:myplaceuri"  <owl:sameAs>  "col:place/asGeonameFeatureName/asGeonameFeature" )
						  ( "var:myplaceuri"  <meteo:forecast>  "var:forecasturi" )
						  ( "var:forecasturi"  <meteo:predicted>  "col:day/asXsdDate"  "inherit" )
						  ( "var:forecasturi"  <meteo:temperature>  "var:tempuri" )
						  ( "var:tempuri"  meteo:celsius  "col:lowest_temp"  "override"  ""  "xsd:decimal" )
						  ( "var:tempuri"  ts:prop/tempuri/units  "celsius"  "literal"  ""  "xsd:string" )
					) 

		
		].	


Overall Configuration File

All the above are linked with a single configuration file (below and at temp.conf.txt), that simply points to the relevant parts of the specification. Note the tc:uri and tc:file are there becasue the storage location (file) and logical name (uri) may be different ... especially as it is still full of 'example.com' namespaces :-(

For a Google Fusion data set simply set the tc:source to point to the data set page such as http://www.google.com/fusiontables/DataSource?dsrcid=86424 . Do make sure it is the original dataset page not a 'snap' ('dsrcid' rather than 'snapid' in the url). While snaps are dealt with they are less efficent.

@prefix tc: <http://example.com/ts/config/> . 										
@prefix : <http://tiree.snipit.org/talis/data/config/temp.conf.txt> . 										

:	tc:base		<http://tiree.snipit.org/talis/csvmapper/temp/> ;

	tc:source		<http://tiree.snipit.org/talis/data/temp_data.txt>   ;
	
	tc:tablespec  [
			tc:uri  <http://temp.example.com/spec/tables/temp> ;
			tc:file  <http://tiree.snipit.org/talis/data/temp_spec.txt>
			] ;
	
	tc:extensions ( [ 
				tc:uri  <http://temp.example.com/spec/extend/meteo/> ;
				tc:file  <http://tiree.snipit.org/talis/data/temp_ext_spec.txt>
			] ) ;
	
	tc:exceptions (
			[
				tc:uri  <http://tiree.snipit.org/talis/data/exceptions.txt> ;
				tc:file  <http://tiree.snipit.org/talis/data/exceptions.txt>
			]
			) ;
	
	tc:options	() .
	

References

  1. [ET21] ET21, Dev8D 2012, dated 15 Feb. 2012. http://wiki.2012.dev8d.org/w/ET21
  2. [FS10] Finin, Tim and Syed, Zareen and Mulwad, Varish and Joshi, Anupam (2010) Exploiting a Web of Semantic Data for Interpreting Tables. In: Proceedings of the WebSci10: Extending the Frontiers of Society On-Line, April 26-27th, 2010, Raleigh, NC: US. http://journal.webscience.org/322/
  3. [Ge10] Geonames ontology. (Documentation) dated Oct. 2010. http://www.geonames.org/ontology/documentation.html
  4. [Gu10] Christopher Gutteridge (2010). Getting Machine-readable Spreadsheets. Southampton ECS Web Team. dated 17th Dec. 2010. http://blogs.ecs.soton.ac.uk/webteam/2010/12/17/600/
  5. [Gu10b] Christopher Gutteridge (2010). Publishing CSV to RDF part 2. Southampton ECS Web Team. dated 17th Dec. 2010. http://blogs.ecs.soton.ac.uk/webteam/2010/12/21/publishing-csv-to-rdf-part-2/
  6. [HF08] Han, L., Finin, T., Parr, C., Sachs, J., and Joshi, A. (2008). RDF123: from Spreadsheets to RDF, Proceedings of the Seventh International Semantic Web Conference, Springer, October 2008. http://ebiquity.umbc.edu/paper/html/id/414/
  7. [Ha10] Michael Hausenblas. Linked Open Data star scheme by example. Linked Data Research Centre, DERI at NUI Galway. dated 15th Nov. 2010. http://lab.linkeddata.deri.ie/2010/star-scheme-by-example/
  8. [Pa11] Sean B. Palmer. An Ontology for Meteorological Data. http://inamidst.com/sw/ont/meteo
  9. [Su10] Ed Summers. the 5 stars of open linked data. Inkdroid. dated 4th June 2010. http://inkdroid.org/journal/2010/06/04/the-5-stars-of-open-linked-data/