« Eclips-ing Python | Main | 2006 reading list »

Transformation pipelines and domain mapping as semantic mashups

Dare Obasanjo: "Proponents of Semantic Web technologies tend to gloss over these harsh realities of mapping between vocabularies in the real world."

Blech, that's pretty weak. This stuff is hard, but really now, who doesn't know that? The people I'm aware of that do work with such technologies (or any technologies) are under no illusions as to how difficult this is. It's no secret I've got little sympathy for the syntax-doesn't-matter position around RDF, but strawmen like the above are pointless Herein some thoughts on domain mapping and metadata.

You can't talk about metadata unification sensibly without an economic angle to quantify what's mean by "quality". Without answering the question "when is the metadata good enough?" you're on a hiding to nothing. In the real world what you do to make the mapping is largely (and often exclusively) dependent on how much money and time you have to figure it out. Indeed being time and resource bound is an operational definition of "real world" for software developers.The people that will pay to have data mapped and suffer the most relative to variant models tend to be inside enterprises, and enterprise these days don't like to take risks and are most definitely resource constrained.

"What I'd be interested in seeing is whether there is a way to get some of the benefits of Semantic Web technologies while acknowledging the need for syntactical mappings as well."

Realistically? Transformation piplines. Separating syntax and semantics sounds clean from some purist or architectural viewpoint, but practically speaking you often need to consider the two together for a given integration. You also want to manage any interesting mappings as discrete units of work so the system doesn't buckle under its own logic.

100% fidelity is a fallacy. As mentioned the people that want metadata unification are most definitely resource constrained. 100% fidelity in most cases is probably not cost-effective or even needed. When you accept 100% fidelity for metadata is a fallacy, this frees you up to look at new approaches instead of pursuing dead ends, in much the same way that accepting latency frees you up in designing distributed systems .

You don't need to do the mapping in one shot.This is one place where the pipelining idea kicks XSLT into a cocked hat. For example if you can map vocabularies A.xml and B.xml into RDF/XML syntax than you can use an RDF or OWL based mapping in turn to the RDF variants to achieve a decent unification. You do not want to try all that in one shot. I think if we make any progress whatsoever in the web scale on automated domain mapping we'll find that the point transformations are being layered and composed along a computational lattice. Given what we know about distributed systems and biological signal processing we should expect this approach of linking up specialized transforms to be more robust than a general purpose technique or canonical models. It also means you can start small and focused, which is critical for successful deployments. When you add URL query strings into the mix for obtaining the data, this is exactly how mashups work today - passing filtered data from script to script via chains of HTTP GETs.

You can automate some of the work. John Sowa reported a few years back on some work on automatically extracting models from databases. The problem presented was to normalize a tranche of database models. In terms of the time+money angle already mentioned, the standard approach - deploy consultants and specialsts to reverse engineer a canonical model - was slated at some years and some millions of dollars, The alternative tried out was to use an automated approach that combined some analogical reasoning and pattern matching to abduct* the models and then unify them. The results were frightening - the automated approach in combination with two guys did a good job, quickly, and at a fraction of the cost. At a lesser level I've seen web applications unified into a single application via automated scraping the apps and automated form fillins, combined with an exception management system where a person steps in and does what the machine can't that was cheaper and faster than the alternative suggestion of throwing out the webapps and unifying their underlying databases into a central one. In this approach, software does data cleansing for the users, whereas in the orthodox approach to data integration, users clean up data for software. In short let the computers filter and preprocess the data en masse and have them spit out whatever they can't resolve at any stage for human analysis. The way the current Web is organised around people pulling down data from search engines and feeds makes for a rather large exception management system.

There's no magic in an automated approach. It boils down to rather sophisticated pattern matching. Even if the final automated mapping results are off, so much heavy lifting had been done by the reasoning and transformation toolchains that the cost to send in someone to clean things up are much reduced. Think of it as HTML tidy or the Universal Feed Parser, but more involved. Using automation here is a classic disruptive play - the results from an automated approach to unifying metadata might not be as comprehensive or as accurate as hand-mapping or shared up-front agreements, but it's so cheap to do, it becomes plausible in its own right.

* in the technical sense, infer the best possible explanation from the data.

January 21, 2006 11:49 PM