Semantics through Action - How it Works

About 1500 lines of Javascript written in a week, so not pretty! View the source (datachooser-child.js, datachooser.js) to see for yourself, but it is in need of some radical refactoring before I'd be happy with it ;-)

Behind the scenes it is almost entirelty client-side Javascript, although intended to link with previously written backend semantic data-scrapping code .

The demo page includes a pane with the underlying target pages each enclosed in an iframe. These are in fact slightly re-written copies of the target pages, automatically transformed by a backend 'decorator' script, which injects a little JavaScript into each page (compare the source of the raw An Cladh Beag page with its decorated version - look at the very beginning and end). The JavaScript captures the onclick event for each element and also adds spans for un-enclosed text (e.g. a <div> with text before a list, but not inside a <p> tag). The decrorator script also has the effect of making the iframe contents come from the same source domain as the main page, and thus allowing the JavaScript on each to communicate easily. Finally, a <base> tag is added to make sure images and links are not broken.

As they load, the target pages all 'register' themselves with the parent and, as they do so, pass a small proxy object. The proxy allows the parent frame to ask code in each iframe to perform certain tasks. This is largely becaue jQuery does not seem to work well across iframes, but also prepares for other forms of cross-frame interactions, for example, if the target page continues to run in its original domain, but with script injected through a bookmarklet.

When the user clicks an element in the target page, the JavaScript walks the DOM tree to build a complete path description, and then passes this to the parent frame, to become the selected element. The path is rendered using a list with CSS styling (not so beautiful, but to give the idea!). When the user clicks one of these path items, the parent page asks the child page to adjust the highlightiong appropriately ) using the proxy objects).

The left hand data pane is populated with placeholders based on a number of predefined schemas. This is precisely as one would expect for scenario 1 and scenario 2, but these schemas themselves would be editable for scenario 3. Currently the schema is simply included in the JavaScript for the parent page as a series of JavaScript objects, but could easily be fetched using a JSONP API.

When there is a current selection, the 'set' button on each data field is enabled. If the user clicks this, the CSS path and value of the currently selected element are associated with the relevant data item (only the value is dispalyed, but the full path is stored internally).

A seperate data structure is maintained for each page being viewed. In a real application, these would be saved to backend data storage, but in the demo are just stored while the web page is viewed.

Each time the user switches pages, the system attempts to generalise from the examples it has seen already using a simple form of machine learning.

Note the current demo is only applying this to top-level fields. Extending this to sub-fields is straightforward, but repeated fields are a little more ocmplex (like generalising across pages, but slightly different).

The algorithm can be seen as a form of concept learning, but has aspects that are special to web pages.

There are two forms of learning.

  1. When there is only a single hand-selected page as an example, it simply strips any ids or classes that are unique to that page.
  2. When there are multiple pages it looks for commonality between the separate CSS paths specified for each page.

In fact (1) seems to work so well, I've not been able to test (2) properly yet!

Note that the paths are not 100% CSS, but have some extra bits, in particular to deal with the fact that CSS cannot refer to pure text nodes (hence wrapping them in spans in the decorated version of the page). In addition, the hand-crafted specification notation allows more complex things such as prefix stripping, that are not yet in the demonstrator.

For examples of hand-written rules see datamapper prototype for:

tireeplacenames A-Z: HTML | JSON | original page
page with a lot of repeats (over 3000!), only first 100 listed (see 'limit' field in the specification)
(N.B. HTML development page messes up the UTF 8 chars, but I think the JSON is fine.)
tireeplacenames place: HTML | JSON | original page
as in the demo
the hand specification incudes things like regular expression prefixes, that are not generated in the automatic authoring demo; some of these may be inferrable, but some may always be for expert hand-editing only
openbible twitter stream: HTML | JSON | original page
both dynamic information, and also a 'deep web' search resource
edit the end of the URL to see live semantics extraction based on different searches

Note that these are obtained by looking up the domain of the url parameter in a 'database' and then matching against the pattern rules in the specifications.

The big picture idea is that this sort of specification should be easily sharable (SxSi !), so that other groups and apps can use, create and share them. For example, the Search Computing project (SeCo) at Milan Polytechnic produces specifications of search end-points (like the openbible one) and I had related kinds of spec in Snip!t . This can then be used in a variety of browser plugins (e.g. external markup), client-side tools (e.g. Aspire resource embedding), or backend processing (links to Kasabi).