Format mappings and transitivity

Dare Obasanjo has responded to my post Format Debt: what you can't say by asking "Can RDF really save us from data format proliferation?". Quoting him, quoting me*:

"Bill de hÓra has a blog post entitled Format Debt: what you can't say where he writes

The closest thing to a deployable web technology that might improve describing these kind of data mashups without parsing at any cost or patching is RDF. Once RDF is parsed it becomes a well defined graph structure - albeit not a structure most web programmers will be used to, it is however the same structure regardless of the source syntax or the code and the graph structure is closed under all allowed operations.

If we take the example of MediaRSS, which is not consistenly used or placed in syndication and API formats, that class of problem more or less evaporates via RDF. Likewise if we take the current Zoo of contact formats and our seeming inability to commit to one, RDF/OWL can enable a declarative mapping between them. Mapping can reduce the number of man years it takes to define a "standard" format by not having to bother unifying "standards" or getting away with a few thousand less test cases. 

I've always found this particular argument by RDF proponents to be suspect. When I complained about the the lack of standards for representing rich media in Atom feeds, the thrust of the complaint is that you can't just plugin a feed from Picassa into a service that understands how to process feeds from Zooomr without making changes to the service or the input feed."

Being a proponent is relative. I'm not sure I'm considered an RDF proponent in the RDF community, having been critical in the past ;) But generally, I can't agree with the argument. Under the hood, it's just mapping and there's no magic here - technically the language (RDF in this case, there are others) will either be able to express the mappings or it won't. For example, RDF can't map celsius to farenheit, but I know it can map foo:title to atom:title.

"The issue I'm pointing out is that either way a developer has to create a mapping."

Right; the questions really are how many mappings, where they are declared and to what extent you can stand over them as being sound. We've be doing this in code for years for syndication formats by mapping them into internal object models in code - every library then having its own mappings that might or might not be consistent. Dare mentioned MediaRSS and without an external configuration for extension formats, we'll have to do for MediaRSS as it appears in the wild today what we do  for the 9+ RSS/Atom formats are out there. The double whammy as part of format of the Format Debt is it appears that MediaRSS needs to be mapped to itself in Dare's examples because parsing syntax can result in different dict/tree data structures.

"The problem with this argument is that there is a declarative approach to mapping between XML data formats without having to boil the ocean by convincing everyone to switch to RD; XSL Transformations (XSLT). "

Not quite the same thing (I'll explain why in a minute). XSLT is actually computationally more powerful than RDF - afaict XSLT could do the celsius to farenheit mapping. It can do knights tour.

"In my experience I've seen that creating a software system where you can drop in an XSLT, OWL or other declarative mapping document to deal with new data formats is cheaper and likely to be less error prone than having to alter parsing code written in C#, Python, Ruby or whatever. However we don't need RDF or other Semantic Web technologies to build such solution today. XSLT works just fine as a tool for solving exactly that problem. "

But XSLT is code. All we're saying by this is that XSLT code is cheaper and less likely to be error prone than Python et al. Which I can buy - an XSLT sheet done well can be an executable specification. All an RDF (or "interlingua") proponent will say is that RDF can be even cheaper and less error prone, and much of the reason not to adopt it is down to developer preferences, lack of familiarity, tooling and so on - i.e., much the same reason developers don't adopt XSLT, summarising the issue as "XSLT sucking".

Finally, I think you can easily argue that RDF/OWL gives more leverage for this kind of problem than XSLT, even though RDF is a computationally less powerful, because it allows you state relationships using formal semantics. For example if I write down that:

atom:title owl:sameAs foo:title

foo:title owl:sameAs bar:title

I can infer

bar:title owl:sameAs atom:title

without writing a line of code and I can use that on seeing new data. The predicate "owl:sameAs" is what the formalists call transitive and this reasoning at a distance is the kind of thing RDF proponents are on about when they talk about "semantic webs". OWL in particular has a boatload of such predicates, sameAs is probably the best known.

That kind of inference is not a remotely straightforward thing to do in XSLT. Rather than emulate Greenspun's 10th Rule by writing a half-baked, incomplete, buggy predicate reasoner in XSLT, you'll end up writing multiple XSLT sheets instead, and possibly trying to chain them together. This is the real problem with using XSLT in anger for this kind of work - it doesn't scale as the number of elements to map grows. In that scenario, people fall back to regular programming languages where you can useful data structures like dicts and lists to manage the element names and their associations. That's why things like the feedparser don't (and won't) tend to get written in XSLT. and it's why the mappings will have to stay as private details of implemetations for now.

* on reflection, I blame Abba Singstar for that particular turn of phrase.




    Are there any multi-format parsers that use RDF inferencing to handle mapping between formats like you describe? If this approach is really as capable as you describe it should work in a more limited scope than the entire internet. If something like Beautiful Soup, or even UFP, existed that was implemented in terms of GRDDL (or whatever parser mechanism was convenient) and declarative RDF mapping it would make for a somewhat more compelling argument.

    I like the idea but I am concerned that the world might be just a little too messy for the declarative approach to work.

    The other benefit to the declarative style of RDF & OWL that you don't mention is that the equivalence (or other) relations can be easily re-published to allow those inferences to be shared. And those inferences can be used for data integration tasks that are outside of those that you originally envisaged.

    So within the specific example, creating and publishing an XSLT stylesheet to normalise or otherwise convert MediaRSS feeds into a common format, can only ever do that single task.

    By publishing OWL statements we can use those assertions in other contexts.

    It seems to me that its time we started taking the same view of code as we do of data: opening it up and encouraging reuse. The way to achieve that is through declarative mechanisms that aren't tied to a particular environment.