Fragged

[I'm putting this down so I can point at it the next time I'm asked about fragment identifiers and URIs]

Norm Walsh:

There's never been any argument about hashed URIs, only slashed ones. As a result, the identifier for me, my physical person, became:

http://norman.walsh.name/knows/who#norman-walsh

As time passed, this had a practical consequence. If you dereferenced that URI, the server would send you the whole 'who' file that contained all the metadata about everyone. That file got to be big.

I ignored this problem as long as I could, simply living with the inconvenience, but when I decided to support “link groups” I faced a real hurdle.

The obvious URI for the 'link group' about me is the URI that identifies me. But linking to the hashed URI made following the link way too expensive to be of practical value. I could have cooked up an alternate URI for the link group, but that would effectively have been an alias. Aliases: bad.

Ouch. But Norm is super smart and he has a found a way to deal with deployed fragment identifiers. Most of the rest of us won't fare as well.

RDF

"There's never been any argument about hashed URIs, only slashed ones."

That's not quite true. The semantic web crowd spotted problems with '#' URIs years ago - they would have be one of the first groups using URIs outside a browser context. I recall Dan Brickley describing them as a "downright broken piece of web architecture" at one point. RDF and its family of specs has taken time out to deal with '#', and no small amount of effort was expended cleaning out that nasty corner of webarch.

HTML

The "#" fragment identifier is a HTML thing. It's is now considered 'context dependent' on the media type, but it's really a HTML thing. That it bleeds up it into URI space and gets used in names is an anti-pattern; similar to how two maybe three generations of application designers and architects ended up using POST for everything.

Every now and then I find myself telling people not to put # on the end of anything that going to get used in a semantic context - which reduces to all URLs.

You really don't want your absolute naming system for a planet to be driven by an arbitrary feature of a markup format. Tying a URI, which you intend to to use to a for stable identification to a HTML fragment is fragile.

Hence I think that URIs ending in # are busted because URIs with '#' present a layering problem which mixes up concerns you want keep apart (which are formatting and naming). URLs with '#', however, are immensely useful, which presents something of a dilemma when it comes to making choices.

XML conveniences

The next thing to do after talking down '#' that is to try and get people to put / on the end of their XML namespaces for convenience reasons. For example here's the Atom namespace:

"http://www.w3.org/2005/Atom"

It doesn't end with /, so moving its names in and out of non-feed contexts becomes an architectural and data modeling exercise instead of a bunch of regexes and string ops.

The problem is that the combination of namespace name and namespace URI is a logical infoset qname fella-in-the-sky thingie and has no defined lexical representation within XML or the XML stack. There's a lot of history and argument about this in the SOAP/XML world going back years - suffice to say it would have been really really useful to have a widely deployed lexical notation for a namespace name, but that window is long gone.

Still it's mostly a convenience and not so bad by comparison with using # in URIs. Heavy users of RDF might not agree entirely - it's nice to be able to move from a XML element to a URI back to an XML element.

 

URIs and names

The W3C went round and round on hash v slash issue in URIs and eventually decided to split the set of web resources in two types - "information" and "everything else" in combination with switching on the response code sent back by a web server when you use the URI in http. It's hard to know where to start, but after 2 years, I've decided the the compromise is devoid of meaning and can safely be ignored, unless you are running semantic web inference tools that ctually try to conform to this nonsense, where you'll be subject to logical GIGO because you got back 200 instead of 303 from the webserver.

"Since the scope of a URI is global, the resource identified by a URI does not depend on the context in which the URI appears "

this is to be very confused about identification - the web is not a special place where ambiguity in names doesn't exist.

elementtree: an aside

One of the reasons I use elementtree for XML work in Python, is because it use Clark notation for namespaces, so instead of having to deal with

"http://www.w3.org/2005/Atomcategory"

or carry some QName object/tuple around, I can work with

"{http://www.w3.org/2005/Atom}category"

which is a joy - it's so nice to be able to write against XML like this.

atomns="http://www.w3.org/2005/Atom"
atomcn="{%s}"%atomns
entrycn="%sentry"%atomcn
...
entries = atomfeed.findall(entrycn)

But Clark notation is not widely deployed.

Back to XML

I think I probably was +1 on the Atom namespace URI (I might have mumbled something about not using a URI with '#' for my sins, but probably didn't). Since 2005 when Atom shipped, nothing has changed to indicate that avoiding '#' was a good idea. Today I'd argue that Atom's namespace URI end with '/' but I wouldn't lie down in the road about it.

Atom Protocol

Then there's this guy:

"http://purl.org/atom/app#"

Which is the placeholder for APP namespace URI. We're going into last call, and I'll be asking that the eventual namespace ends in '/' and definitely not in '#'.

Cool URIs

It's not just names. There's so much irony here around cool URIs and fragment identifiers:

http://www.dehora.net/journal/2007/02/journal_migration_i_export_entries_from_1.html

it took me a whole weekend to absorb it. It really knocked the wind out of me that I might not be able to support the permalinks to hundreds and hundreds of comments. The Movable Type software my weblog is based on provides a consistent way to link to comments via its template system (so it'll produce consistent fragment URLs), but fragment URLs will typically barely survive a reskinning, or a rewrite.

I'm coming round to the idea that you can't have Cool URIs if you have fragments in them. I'd love it, just love it, if the W3C would say something about Cool URIs and fragment identifiers. And it does make me wonder whether purple numbers are a good idea.

2007-02-20

Tags:

    tags: