« links for 2007-02-03 | Main | Topic Oriented »

Journal Migration I: export entries from

As mentioned elsewhere, I am moving off Movable Type. In favor of my own codebase. I've decided weblogs are to this decade as editors were to the 1970s. You have to write your own. It's a pretty thin rationale - the 1970s more or less sucked as I recall*.

Cool URI Tax

I would very much like to preserve this journal's URL space. It's turning out that coding a weblog might be less working than porting the entries while preserving URLs. URLs matter. It's not just me you know - the W3C's Cool URI dogma more or less says I'm a random idiot if I don't do this or didn't have the forethought to use software that doesn't make this anything but a snap. Nothing's worse than being a random idiot.

So. I've been writing a parser/loader against MT's export file. MT does not export Atom/RSS entries as you might expect, it outputs a line based dump format.


Trackbacks

It turns out that the MT dump format does not export trackback URLs. I thought this was bad. MT does tell me whether the trackback is enabled for the entry and lists out received trackback details, but not the URL. I said to myself - "this renders MT's 'export entries from' feature as at best, underwhelming - I'll have to go to the DB to get at the trackback table. This in turn will be awkward as MT does not appear to output any kind of keys in the export that I can uses to cross-reference the database with. I do mean any kind - not just pks, there are no URLs or slugs sent back - it's just data. I will have a Cool URI crisis if I don't solve this." 3 minutes (of actual) thought later, I realised it didn't matter. Trackback URLs are dynamically discovered either off the entry's html page (as commented out RDF! I know! It's crazy!) or perhaps in the entry's XML as an extension. They don't need Cool URIs.

Comments

The comments themselves do need Cool URI work because MT builds permalinks for them. Two points of mention on that.

Fragment IDs. First, my MT templates output links to 'comments' and 'permalinks' in each entry's HTML page as fragment identifiers (I think most MT templates keyed off the default one do that). That's ok I suppose - I can write a mapping for "#comments" and "#trackbacks" and have them redirect into some... urk, wait. I can't do that. Fragments don't function properly at the HTTP layer - for example they don't redirect**.

Primary keys. Second, the comments themselves have permalinks - again using fragment identifiers. Unfortunately these look like this: "#comment-104897". I suspect that number at the end is an autoincrementing primary key, and that key is not in the export file. This time, 3 minutes of thought has not dispelled the idea that I'm going back to the database to get the comment pks. Did I mention that MT does not appear to output any kind of keys in the export that I can uses to cross-reference the database? Urk.

Conclusion.

It would be a very good thing if future versions of weblog tools exported content as Atom/RSS and not some custom file format.

Irrespective of MT defaults and biases, my use of fragment identifiers has been shortsighted. I chose Django for my weblog's codebase, and with good reason. It has direct support for dealing with exposed URLs of this kind. Something like '#comment-104897' can be mapped so that '104897' can be selected against the 'random_idiot_legacy_fragid' field in the content model***. And it can be done so that you don't need to think about preserving legacy fragment identifiers in html for all eternity (read: I come up with some templating hack to auto-embed them into the entry html pages).

As an aside, this exercise would be more problematic if the journal used MT's default of serving URLs with the entry primary key as the slug. At least the current approach of building the URl space with Years, Months and Titles means the current URL space can be reversed engineering from the data. Having to migrate auto incremented keys from an export file that doesn't supply them would bite, even though it looks like I will have a small-scale-alike problem with comments. I understand this might be counter-intuitive, as it seems I'm favoring natural keys over synthetic ones.


* I tell my children the 1970s in Ireland really were in black and white, just like old movies, but with more rain. Color come to Ireland around about 1982, and there was color reality in the US as far back as the 1960s. They don't believe me.

** At this point, I'm tempted to conclude that fragment identifiers or anything like them hurt if Cool URIs are the goal. Any URL that points into a HTML page (a representation that can change over time) is going to have a hard time remaining cool; it's a very high bar for publishers and especially for template/site designers. I missed the W3C memo explaining how the fragment id feature they recommend for HTML is Cool. I wonder how stable purple numbers will prove to be in the age of templates.

*** My second Django evaluation project involved a content migration exercise, and that was deliberate. Django can deal with legacy and/or rubbish URLs because it dispatches using regular expressions and has direct support for slugs in URLs, via regex groups. If you're going to commit to a web framework, a content migration exercise is a good way to stress its design.


February 5, 2007 07:14 PM

Comments

Chris Dent
(February 5, 2007 09:48 PM #)

A long term goal of purple numbers has been: make each identified node a first class addressable thing with a unique identifier that redirects (insert vapor here) to some current home for that node.

So while these days most purple implementations display a #nid fragment for the URI of a node, another way to do it would be to display a non-fragment URI of the node at some kind of naming service. That service redirects.

This is all mostly talk at this point, but is where most of the interesting talk related to Purple has happened. Having some way to generate good IDs and then look them up is the key to having persistently identified nodes that can move around and still be referenced and most importantly transcluded.

Last year Eugene and I wrote a Purple server that takes the Purple Number generation part out of Purplewiki and makes it easy to distribute nids around to multiple sites and maintain an index of those nids. From that it is possible to do inter-server transclusions.

That's on CPAN, as Purple.

James
(February 5, 2007 10:44 PM #)

Maybe I'm missing something, but surely you can hack up the export script to give you the references you need?

Dominic Mitchell
(February 5, 2007 10:46 PM #)

Just a minor note; wordpress 2.1 now comes with an export function. It outputs RSS + wordpress specific extensions in what looks like a namespace (thought I didn't check it closely). This is a big improvement.

Darren Chamberlain
(February 13, 2007 03:07 PM #)

Re: URIs, have you considered simply building a big .htaccess that contains all of the old URIs and maps them to new URIs (whatever they might be)? I migrated a site from MT to a wiki, and all the URIs were going to change, so I built an index template in MT that basically did iterated through the last 10000 entries and wrote RewriteRules for each entry. You could issue permanent redirects, or instruct Apache to use internal redirects. This lets you keep your existing URIs and still move to your new platform.

Post a comment

(you may use HTML tags for style)




Remember Me?

Trackback Pings

TrackBack URL for this entry:
http://www.dehora.net/mt/mt-tb.cgi/2033