Format Debt: what you can't say

Aristotle: "In passing, though, I have to note that it would be nice if we could do a better job of what media types tried to do with their type/subtype separation, ie. have a standardised way to specify a layering of specifity of formats, including multiple formats, so that it would be possible to say that a document is text, and specifically HTML, and specifically a combination of hCard+hTag+ hEXIF+image-link, and specifically a Flickr photo, so as to allow clients to know what the representation means without having to parse it, at whatever their level of understanding of the specified format.

I don't know if this would work in practice, after all the type/subtype thing in media types is mostly a failure. Maybe that was just because of it tried to constrain types to just two layers. It would also be necessary to do a better job of what media types tried to accomodate with the '+xml' suffix contortion, ie. make sure that types reliant on possibly multiple lower-level formats are expressible in a sensible fashion."

There are limiting returns on patching around media types and formats. This suggests doing a better job becomes increasingly harder. Let's call this "Format Debt". I think the media types construct is entirely inadequate for expressing mashed up formats in the way Aristotle wants and we will be limited to patching around it -  the media type is deeply embedded into web architecture. I take a polarised position on this, because I think it's less important to be right that push the debate along.

The syntax first, and liberally, approach is good for adoption but has limits, such as inconsistent placement (eg with MediaRSS in feeds), field duplication (eg Activity Streams in Atom) and structural hacks (eg RDFa's Qnames in content), weakly-defined qualifiers (html/atom rel, HTML5 data-*). Or parsing at all costs.

We say we want layered formats, because that's what the combination of IETF IDs, W3C Recommendations and deployed browsers and servers allow us to say. It's the Web version of of the Blub paradox.  What we want is layered data.  What we want is not just to qualify a media type, but to describe the ingredients in the entity whose "shell" is the media type.

I think the argument that identifying and extracting mashed data from entities should happen at a higher layer than transfer is a good one. But an interim approach for dealing with Aristotle's wish might be media type extensions to well known formats that flag contain mashed up data is contained within. These types won't be as specific as to say what exactly is contained (this is neopolitan, this is raspberry ripple), but it's enough information for a code switch.

Such an interim approach won't scale well - for example trying to articulate the specific media type for a HTML document containing RDFa with a slew of vocabularies and divs with slew of microformats is not viable. There is higher order data the way there is higher order programming and this is too difficult to capture in general with media type declarations. Roughly - microformats are to HTML as closures are to functions, and RDFa is to microformats as a macro is to closures. Another limitation is that people doing basic publishing are not going to be speccing the served media type - most people don't know what a media type is. The frameworks will need to support that kind of specificity, which means the editing tools need to signal to the server what's being published is mashed up.

A better interim approach might be to "just use" a new HTTP header.

Another approach might be to ignore the syntactic structure altogether post-parse for extensions in code APIs. When the chunk of syndicated XML or rdfa/microformatted HTML is turned into code, that code can have method that returns the list of the found extensions as data structures instead of asking developers to hit and miss through the code. The found extensions can be in turn iterated over. I've written code like this against Atom that allowed you to get all the links matching a rel value argument without caring about their placement. The HTML5 DOM does something similar for data-* attributes and it seems doable for syndication/html extensions in general in other libraries. The Universal Feed Parser and Beautiful Soup indicate syntax reality is messy but can be dealt with.

Colophon: RDF

The closest thing to a deployable web technology that might improve describing these kind of data mashups without parsing at any cost or patching is RDF. Once RDF is parsed it becomes a well defined graph structure - albeit not a structure most web programmers will be used to, it is however the same structure regardless of the source syntax or the code and the graph structure is closed under all allowed operations.

If we take the example of MediaRSS, which is not consistenly used or placed in syndication and API formats, that class of problem more or less evaporates via RDF. Likewise if we take the current Zoo of contact formats and our seeming inability to commit to one, RDF/OWL can enable a declarative mapping between them. Mapping can reduce the number of man years it takes to define a "standard" format by not having to bother unifying "standards" or getting away with a few thousand less test cases. 

RDF has after a decade seen limited deployment, developers and publishers peferring instead to incrementally patch syntax. Atom has XML extensions and rel attributes. HTML has RDFa and microformats. A few years ago RDF tended to be heavily criticised by syntax proponents. You are free to search through the xml-dev and rest-discuss archives, or search for "RDF Tax" or "RDF syndication war" to see what I mean. I hope a year or two out, people will be less dismissive and at least willing to learn from RDF as the nuisance factor of formats and media type limit increases.



    One detail: RDFa went out of it's way to avoid using QNames in content. It uses an alternate abbreviation mechanism called CURIE. I might even argue that it's not a "structural hack". -m

    IMHO, its not nice to reply to a mail list thread discussion in your blog ... you should post it on the list !