« links for 2006-11-07 | Main | links for 2006-11-08 »


The problem: take XHTML fragments, parse out all the "a" tags, and check to see if their linked resources are of a certain type. If they are, derefence that content and inline it into the fragment, leaving non matching a tags alone. That ignores a raft of environmental details, like permssions, link type checking, link availbility, testing on an app server, skinning the deferenced content, speed, and so on. The difficulty: the markup fragment might not be well-formed.

My first rection was to use regexes, which meant I had two problems. I would have had to split the content into regex groups split by the links,process the links, keep a memo of which links are up for expansion, which are not, dereference the content for the expandables, inline that content, stitch it all back together and send on the output. It looked at best, complicated. My second reaction was a stream parse and intercept of the a tags, writing out embedded content where the links matched the inlinable types. I couldn't find tools in Python that will handle dodgy markup in streaming mode and write the content back out cleanly (as TagSoup does for Java).

Why not insist that the content come in well-formed? That would open up the toolchain. But that would also hurt the users, as they want to be able to preview in mid-flight, in that case being facistic about well-formedness will just makes the application frustrating to use. Well-formed markup is the end, not the means.


I wound up restating the problem - accept that the fragments would be a mess - now what?

I ended up using a library called BeautifulSoup. BeautifulSoup is Python code that will parse junk markup and give you a tree. Really it's quite something, it'll take on any old nonsense and create a HTML tree in memory. It also goes a very long way way to get your content into a decent state for Unicode.

It worked. I was eventually able to get inlined content to come out as a microformat. The lesson I (re)learned was that using BeautifulSoup, and in the past Universal Feed Parser and Tidy, makes it clear there's some economic value to be had in giving up on well-formedness in a judicious fashion.

[By the way, Effbot has announced an ElementSoup wrapper for BeautifulSoup.]


Engineers have a concept called tolerence. A tolerance specifies the variance in dimensions under which which a part or component can be built and still be acceptable for production use. There's all kinds of ways to state tolerence, but perfect tolerances are neither physically possible nor desirable, they are too expensive. There is a diminishing returns curve for manufacturing cost along how tight you make a tolerance. Engineers (real ones, not programmers) use tolerances to actively mange cost and risk.

Every major commercial project I have worked on, every one, has had the issue of "data tolerances" being off, where two or more systems did not line up properly. The result invariably is to fix one end, both ends, or insert a compensating layer - what mechanics call a 'shim' and what programmers call "middleware". Software projects unfortunately don't have notions of tolerance. In software we lean more towad binary and highly discrete positions on the data -"wellformed" v "illformed" "valid" v "invalid", "pass" v "fail", "your fault" v "my fault". This doesn't just happen before go live - interoperation is subject to entropy and decay - systems will drift apart over time unless they are tended to. Reality is Corrosive.

There's a political dimension to consider. If you accept you might get junk every now and then, and introduce permissible levels of error, you get to mitigate the interminable and inevitable blameslinging over who should pick up the tab because two systems data do not line up as predicted. I've seen schedule put at risk over such arguments, when the costs could just as easily been been shared.

We don't have the tools or metrics just yet for defining data tolerances as as acceptable practice, but it might happen if enough of these kinds of parse anything libraries come online, that we can come to put a dollar cost on what it is involved in insisting on having perfect markup flying about end to end versus judiciously giving up on syntactic precision.


The code for BeautifulSoup is worth a read, along with Tidy, TagSoup, and Universal Feed Parser. Overall, they read like bunch of error correcting codes strangling a parser.

If we assume or allow that most data on the web is syntactic junk and will always be syntactic junk, and in truth there's no reason to assume otherwise, then there is a good argument that says we'll need a layer of convertors whose purpose is to parse content no matter what. My takeway is that the Semantic Web, or anything less grandiose but essentially similar in aims, such as structured blogging, microformats, or enterprise CMSes and Wikis can embrace code like BeautifulSoup, TagSoup and Universal Feed Parser as neccessary evils.

update via James: Ian Hickson is defining how parsers should deal with invalid HTML.

In the semantic web case, I think tag soup parsers are a fundamental layer to that architecture - syntactic convertors that work just like analog-to-digital converters. They set you up for making sense of the data by actually allowing you to load it instead of dropping it on the floor and failing. Without that layer, tools like Grddl, (a way of extracting RDF out of XML/XHTML) don't get to execute at all. [by the way, there's plenty of prior art in robotics and physical agent systems for building these kinds of layered or hybrid architectures.]

Now, some people will find simply entertaining the idea of junk content a deplorable state of affairs, that will inevitably lead to some kind of syntactic event horizon, where the Web collapses under the weight of its own ill formedness. On the other hand if you allow for some garbage in and try to do something with it, you get to ship something useful today, and perhaps build something more valuable on top tomorrow. Plus we're already in a deplorable state of affairs. I find myself conflicted.

Last word to Anne Zelenka, speaking about the feed parser:

"I wouldn't call it a necessary evil, just necessary. Life is messy :)"

November 7, 2006 12:33 PM


Ian Bicking
(November 7, 2006 05:43 AM #)

I've had pretty good luck with lxml/libxml2's HTML parser. For me it's been the primary selling point for that library.

Paul Downey
(November 7, 2006 09:45 AM #)

I particularly like the notion of "tollerance" and the inescapable need for an A-D convertor. I wonder if there is value in standardising how a soup parser should work, which bits they can ignore, a bit like the frequencies you can dump when encoding MP3s ...

(November 7, 2006 12:05 PM #)

(November 7, 2006 04:33 PM #)

I used the universal feed parser when I was doing a little python project, it was great. I wouldn't call it a necessary evil, just necessary. Life is messy :)

(November 9, 2006 06:10 AM #)

Wow, thanks for giving me the last word, cool!

And yet I still can't shut up.

Bill de hOra
(November 10, 2006 11:33 PM #)

"And yet I still can't shut up."

Please don't :)

Post a comment

(you may use HTML tags for style)

Remember Me?

Trackback Pings

TrackBack URL for this entry: