" /> Bill de hÓra: April 2005 Archives

« March 2005 | Main | May 2005 »

April 28, 2005

How to write a job spec


April 27, 2005

RSQ: Really Simple Querying?

Adam Bosworth, with his 'S4' criteria for mass adoption on the web (simple, sloppy, standard, scalable) seems to have clarified the worse is better debate over web data formats. Mike Champion had an interesting comment on RDF and Atom's slopworthiness as compared to RSS over on David Megginson's related entry 'RSS as the HTML for data':

"Bosworth's presentation is very well worth studying in this respect. He says that successful Web-scale technologies tend to be simple (for users), sloppy, standardized (widely deployed in a more or less interoperable way, irrespective of formal status), and scalable. I don't think Atom or RDF meet these criteria. Atom's main value over RSS is supposed to be its FORMAL standardization, but apparently nobody really cares. (Tim Bray's 'Mr. Safe' has not appeared, but RSS interop and even extensibility is happening and making it boss-friendly in practice). RDF is not simple for ordinary mortals, and its scalability is unproven. (I have been informed that actual RDF systems handle sloppiness well, even though one would think that its basis in formal logic would make it brittle... I don't know how to evaluate that)."

I can only assume that Mike's referring to recent discussion on xml-dev among other things. Over there I agreed that RDF has simplicity/comprehesion issues, but pointed out with a few simple examples that RDF is lot more tolerant of partial and missing information than some people realise. For example Daniel Steinberg also commenting on Adam Bosworth's keynote, thinks that total agreement is a requirement:

"Bosworth predicts that RSS 2.0 and Atom will be the lingua franca that will be used to consume all data from everywhere. These are simple formats that are sloppily extensible. Anyone who wants to can use these formats to consume content or to author content. Contrast this with the Semantic Web, which requires that you get a large group of people to agree on the schema of everything."

In reality, what Daniel said is is not true about RDF - RDF was designed with the unexpected in mind. A lot of this misunderstanding has had to do with early hype about the Semantic Web, which has at times sounded suspiciously like AI reborn. It also has to do with the way the benefits of RDF have been couched - critically, when WS technology adoption was on the up and up, the emphasis of Semantic Web standardization within the W3C was on formalization of the technology rather than useful applications.

RDF receives its robustness and flexibility properties from its design, and two design properties stand out.


First is the graph model that RDF is based on. All RDF data organized as a graph, different from XML tree based document structures and vaguely like relational databases, but without the idea of tables. The beauty of the graph model is that it is 'additive'. That means you can keep merging new items onto the graph without having to create new data structures to support new information. Using RDF as the data model, queries and merging operations end up producing new graphs as their results, in much the same way SQL query results are also tables. More importantly it makes for a clean programming model. It's extensible and uniform. It's also 'subtractive', which means you can take data out of the graph and leave a smaller graph behind just the same way you'd remove an item from a hashmap, but with the hassle of doing something like dropping a column or table in a database (in the developer trenches, adding or dropping database columns can be the stuff of nightmares). For scalability, breaking up large graphs of data into smaller ones allows us to physically distribute datasets.

The most interesting slide in Adam Bosworth's presentation are not just the ones that feature S4, but the diagrams which show queries being divvied across servers (thanks to Mike for the sending on the link). While, it's known that Google break out their indices across a cluster into what they call 'shards', Bosworth's model looks like the late Gene Kan's Infrasearch query router, now part of the JXTA project. As a counterpoint, Doug Cutting of Lucene and Nutch fame has said, more or less, that there's no great advantage yet to distributed queries across the web in this way over downloading and centralizing the indexes:

"Widely distributed search is interesting, but I'm not sure it can yet be done and keep things as fast as they need to be. A faster search engine is a better search engine. When folks can quickly revise queries then they more frequently find what they're looking for before they get impatient. But building a widely distributed search system that can search billions of pages in a fraction of a second is difficult, since network latencies are high. Most of the half-second or so that Google takes to perform a search is network latency within a single datacenter. If you were to spread that same system over a bunch of PCs in people's houses, even connected by DSL and cable modems, the latencies are much higher and searches would probably take several seconds or longer. And hence it wouldn't be as good of a search engine."

Whether downloading the web into a cluster for indexing is the way to go indefinitely remains open if the amount of data being generated exceeds our ability to centralize it. At some point Jim Gray's distributing computing economics might flip in favour of sending the query out after the data rather than trying to localize the data. William Grosso has wondered whether Gray's model invalidates semantic web precepts:

"Now along comes Gray, making an argument that, when you think about it, implies that the semantic web, as currently conceived, might just be all wrong. His basic point is that it's far cheaper to vend high-level apis than give access to the data (because the cost of shipping large amounts of data around is prohibitive). Since the semantic web is basically a data web, one wonders: why doesn't Gray's argument apply?"


Second is the "open world" assumption of RDF. What that means is that not finding the answer to a query doesn't mean the query is false. For example if searching for an Atom entry's summary finds nothing and you conclude there's no summary for that entry, that's a closed world assumption. But in RSS1.0, which is RDF based, you'd conclude you don't have a summary to hand, not that it doesn't exist. The data might be incomplete at the time of asking. Dan Brickley describes this as 'missing isn't broken':

"Developers who come to the Semantic Web effort via XML technology often make an understandable mistake. They assume that missing is broken when it comes to the contents of RDF/XML documents, that if you omit some piece of information from an RDF file, you have in some formal, technical sense 'done something wrong' and should be punished. RDF doesn't work like that. Missing isn't broken. In the general case, you are free to say as much, or as little, in your RDF document as you like. RDF vocabularies such as FOAF, Dublin Core, MusicBrainz, RDF-Wordnet don't get to tell you what to do, what to write, what to say. Instead, they serve as an interconnected dictionary documenting the meaning of the terms you're using in your RDF documents."

The case of the Description Logics and ontology worlds coming to the Semantic Web and worrying over queries that will blow up in the engines is much like the case of the enterprise world coming to the Web worrying over type systems and discovery languages. The likeness is not fleeting - both the Semantic Web and Web Services advocates have been busy building competing technology stacks in the last decade. They have valid points and good technology but the need or demand for such precision in the Web context has been overestimated. As Pat Hayes put it:

"It is fundamentally unnecessary. The semantic web doesn't need all these DL guards and limitations, because it doesn't need to provide the industrial-quality guarantees of inferential performance. Using DLs as a semantic web content markup standard is a failure of imagination: it presumes that the Web is going to be something like a giant corporation, with the same requirements of predictability and provable performance. In fact (if the SW ever becomes a reality) it will be quite different from current industrial ontology practice in many ways. It will be far 'scruffier', for a start; people will use ingenious tricks to scrape partly-ill-formed content from ill-structured sources, and there is no point in trying to prevent them doing so, or tutting with disapproval. But aside from that, it will be on a scale that will completely defeat any attempt to restrict inference to manageable bounds. If one is dealing with 10|9 assertions, the difference between a polynomial complexity class and something worse is largely irrelevant."

Pat Hayes is an interesting person to have said that. He's a legend in the world of AI in the way Adam Bosworth is a legend as a software developer. Both have concluded in their own ways that the 'neat' orthodoxies implicit in Web Services and the Semantic Web are futile. Cleaning up the Web is infeasible.

If you come from an SQL/XML background the open world idea of everything being effectively optional is going seem weird and unworkable, but what it really means is that every addition of data is an extension act - extensibility is intrinsic to the RDF way of doing things, not something that gets bolted on as with mustUnderstand/mustIgnore. The same intrinsic nature goes for distribution of datasets. Since RDF data can be distributed across any number of nodes, the technical challenge is not scaling the database across clusters it's routing and distributing queries. Query routing is a special case of the kind of packet routing problems that occupy telecoms, peer-to-peer and internet engineers. Adam Bosworth is right, we need Really Simple Querying, but it's a bit early to rule out RDF as a good fit for returning the results or dealing with scale issues.

April 23, 2005

Sideshow bar

This is boring.

The Netbeans guys have it right. The JUnit green bar is too noisy. I wouldn't mind if Junit4 binned it altogether.

April 22, 2005

Bad Server!

What Yahoo! Groups has to say sans cookies:

"Your browser is not accepting our cookies. To view this page, please set your browser preferences to accept cookies. (Code 0)"

HTTP abuse?

Bad Framework!

Leigh, and Mark.

"In the Java world we've got Java Servlets. And what's the most commonly implemented method? It's not doGet its the generic service method. Thats the point at which the majority of frameworks hook into a Controller servlet to dispatch to server-side request handlers.

And yet, how many frameworks encourage or even allow the binding of handlers based on a combination of URL+method? In my experience the protocol independence anti-pattern kicks in at that point, and the request method is the last thing that a developer is encouraged to take into account. The end result are URLs that react identically to any request method. It might be an interesting experiment if Udell tried sending PUT, DELETE, HEAD requests to the same API calls." - Leigh Dodds

"All of this adds up to people not being able to count on the availability of mechanisms to set Web metadata, and therefore a failure to use what the Web provides. Take a look at Web applications like Wikis, Blog engines and commercial packages that you deploy on a Web server (I dont want to pick on anyone particular here, because everybodys in the same boat, and its not their fault)." - Mark Nottingham

Good to see HTTP abuse getting some attention. Maybe the W3C won't bake HTTP subsets into their specs anymore.

April 20, 2005


Malware Evolution, Kapersky Labs:

"Virus analysts have conducted a number of tests to check whether or not automobile on-board computers running Symbian are infectable. At the time of writing, the tests show that the answer to this question is negative. However, this may well the next target for virus writers, and research will continue. Overall, the worms and Trojans created for smartphones are the harbingers of the malware storm to come - smartphones, smart houses, and the devices and technologies of the future will provide endless opportunities for generations of cyber criminals to come."

[via Steve Loughran]

Cruisecontrol not starting JBoss container

Came across this one in work yesterday. What's happening is that some of the guys are using Cruisecontrol (2.2) to run a nightly build along with a deploy/smoketest into JBoss; as part of the build JBoss is stopped, and then started via a Java target. This setup works fine when the build is called directly via Ant, but when run from Cruisecontrol, JBoss is not started and the Cruisecontrol cycle hangs.

Here's the Cruisecontrol fragment that calls the ant build file:

      <ant time="0300" 

The target in cc-build.xml being invoked looks like this:

  <target name="build" depends="stop.appserver, init, clean, get-code">
    <ant antfile="build.xml" 
    dir="${src.localpath}\build" target="nightly-build"/>    
    <antcall target="start.appserver"/>	  

And the start.appserver target looks like this:

  <target name="start.appserver" description="Start the Appserver server." depends="init">
    <java dir="${appserver.home.dir}/bin" classname="org.jboss.Main" fork="true" spawn="true">
      <arg line="-c default"/>
      <jvmarg value="-Xms32m"/>
      <jvmarg value="-Xmx200m"/>      
        <pathelement path="${appserver.home.dir}/bin/run.jar"/>
        <pathelement path="${java.home}/lib/tools.jar"/>

I suspect the JBoss JVM process is forking out in a way that perhaps has the Cruisecontrol JVM hung waiting for it to return. I haven't had to time to really go digging into this but I'm thinking that an exec target might work better instead of Java. Another possibility I suppose is that the running JBoss is not being stopped fully before the new instance is started (but that doesn't happen via Ant). Anyhow, I thought I'd throw out there to see if anyone had come cross this before.

April 17, 2005


With the current furore over Andrew Tridgell reverse engineering the Bitkeeper wire protocol, it's interesting to note that the argument seems to be not over the wire protocol but enabled access to the metadata via understanding the protocol. Tridgell has done this before with Samba. I imagine Bitmover have every right to claim that the metadata is part of product (presumably it's generated by the software), but it seems then to be difficult or impossible to manage the code without the metadata. Caveat Emptor then.

If so, the Linux kernel SCM argument ratifies the notion that data is the new lock-in. Who owns data and metadata and who has access to them is an important issue.

From a technical perspective, it's arguable that the higher you go up the programing language stack the fuzzier the distinction between software and data is. If you had to look at a typical Java or C# system, it'd be clear enough for the most part what's data and metadata and what's code. Technologies like annotations make this fuzzier, but not impenetrably so. XSLT scripts can get fuzzy, as can systems utilizing code generation. A significant Lisp system could make for an interesting data ownership argument. Lisp advocates having been preaching code==data for decades. Consider that the configuration files for my emacs editor are in Lisp, or that using Python or Ruby source code to store configuration details (rather than XML) is a common idiom. Down the line, I can imagine a rules driven system based on Topic Maps or RDF data being equally fuzzy.

In short, a lot of innovation in enterprise and commercial software is about blurring the line between data and code. I would love to see those knowledgeable in Open Source, Web, Compliance and IT Governance matters pick up on this issue, and maybe focus less on software licencing. Most RFPs that pass my desk assume that what is the data is in a system is largely obvious. They've no doubt been set straight on music, but I would guess that most folks think that they own their IM conversations, their email, their weblogs, and their photos. It's not just that people won't own their data - it's not unfeasible to imagine a situation where a software provider had to turn the code over and give up a strategic technology advantage to enable access to the data.

April 15, 2005


"But think about their first impression of computing. It was scary. It is scary."

Dave Pawson!

Via Norm Walsh, Dave Pawson has a blog: http://nodesets.blogspot.com/. Seems like he's doing 12 rounds with Tomcat.

April 09, 2005

Python antpattern: UserDict as object scaffolding

In UserDict as object scaffolding I mentioned that:

"I have a habit, when working in Python, of starting classes by extending UserDict, usually because I dont have a strong idea of where I'm going just yet."

I've had enough feedback to suggest this is either pointless or a bad idea. So I'll be unlearning it.

MIT Press classics series

A while back, I reviewed the book How To Write Parallel Programs (HTWPP), which is sadly out of print. As an aside I said this:

"O'Reilly should run a "Classics" series like Penguin do for novels and plays, but for out of print or half-forgotten computer literature. I think they'd have a long-lived franchise."

Just yesterday Bob Prior at the MIT Press made the following comment on that review:

"Thanks for your great reviews of what happen to be two books published by The MIT Press. As to your suggestion for a 'classics' series, it so happens that we already have such a program and I will definitely consider adding How to Write Parallel Programs to that list. SICP, as others have pointed out, is already available on-line."

Aside from the possibility of getting such a great book back in print, MIT Press classics already has all kinds of great books up there. Here's a taster:

I had no idea the series existed. And it's not just CS books - economics, philosophy, architecture to name a few subject categories.

(Thanks to all the folks who pointed out that HTWPP is available online. Go read it!)

A little typing is a dangerous thing

Then again, with all this talk of dynamic typing, and Python, and Groovy and Ruby on Rails, perhap we should stop and consider whether the Java world is ready for type freedom. Yow.

(From ScottMcPhee)

April 08, 2005

Jira - old school


April 07, 2005


Vincent Massol would love reusable Ant tasks:

"For example, you may think that deleting a directory is simple. But it's not so easy. Have a look at the Delete Ant task source code. You'll find portion of code like this one:
 * Accommodate Windows bug encountered in both Sun and IBM JDKs.
 * Others possible. If the delete does not work, call System.gc(),
 * wait a little and try again.
private boolean delete(File f) {
     if (!f.delete()) {
          if (Os.isFamily("windows")) {
          try {
           } catch (InterruptedException ex) {
               // Ignore Exception
     return true;

Would you have thought about this? Probably not and you would have been right not to as this only happens in some rare occasions."

I've thought of it, sure, because it's bitten me before. Repeatedly. (Here's the rant...) And it's not rare (or at least, not rare enough :). Try writing junit tests which add and remove enough directories or files between setups. Nightmare. Try doing industry standard .do/.done file weirdness and getting the support calls when the files are left lying around. Nightmare. And that system.gc() hack doesn't always work. I'm not even sure it's considered a JDK bug - Java by design is so abstracted from the actual filesystem it can't offer guaranteed side effects for file operations. So you need to treat these things as best effort. Given that gc is also best effort there's still have room to fail in the Ant code above (I think). I wrote a countdown once to repeatedly try a deletion and failing that, bail out with an email to ops. That's spinning the CPU more that a sleep() but you got more shots at deletion. These days I'd tend to the idiom which deals with files whose modification time is X milliseconds older than currentTimeMillis. Or if you must, fork a process (btw, how Ant forks processes is great; everyone should re-use that).

Couldn't agree more about re-using Ant tasks however (exec being a great example):

"The problem is that the Ant tasks are a bit too much linked to the execution engine (the XML scripting engine). For example reusing an Ant tasks requires you to create a Project object. This in turn drags loggers, the Ant classloader (in some cases) and possibly other objects. I know it's possible to use Ant from Java (I've been doing it for a long time now) but I'd love it be even easier to do so... I'd like to see Ant separate into 2 subprojects: one for the XML scripting engine (let's call it engine) and one for the Ant tasks (let's call it tasks). The reason for the 2 projects is to ensure there's no dependency in the direction tasks->engine."

I think another reason the Tasks are tied to the Ant engine is because Ant doesn't have standard i/o (eg the way Unix pipes do). Task.execute() is void. I use a set/get/execute(in,out,err) idiom a lot for XML pipelining in Java, it was taught to me by Sean McGrath. The reason that works under those circumstances and any component in the pipeline is reusable/reorderable is because the XML in and XML out provides uniform i/o. A Uniform API might not be quite enough - you might have to ask what Ant's answer to | is. Without the i/o abstraction I don't know if you can achieve what Vincent wants in terms of depdendency management.

But of course none of this stuff was the original intention of Ant.

antsh: turning Java into shell scripting, task by task!

Jaxen 1.1b4

Jaxen 1.1beta4. I moved some code from a custom API to Jaxen 1.0 a few months ago (even in moratorium a good library). So it's really good news that Jaxen is active and will make a 1.1. I had compeletly misssed that it was active, or that Elliotte was working on it.

The open source world needs more CVS commit RSS feeds - it's way easier to stay on top of releases that way.

April 05, 2005

An RDF-backed Movable Type hack

Cool: MT-Redland

[via Danny, aka "I've-got-mucho-domains"]

Customize Me

Dare Obasanjo on attention.xml and collaborative filtering:

"Once one knows how to calculate the relative importance of various information sources to a reader, it does make sense that the next step would be to leverage this information collaboratively. The only cloud I see on the horizon is that if anyone figures out how to do this right, it is unlikely that it will be made available as an open pool of data. The 'attention.xml' for each user would be demographic data that would be worth its weight in gold to advertisers."

Collaborative filtering alledgedly only works if you have a critical mass of items of interest and users to cross-reference. I heard once this needed to get to the low 1000s to ensure reasonable precision. That was back in 2000, by which time people had figured out how to process large in-memory datacubes in close to real time (ie updates occuring between user sessions).

That's on the server.

What we're not doing is considering how filtering might work on the client. When more specific information about the user is available, it's possible to optimize these algorithms to work with much smaller data sets, and in general to think about different algorithms or hybrid approaches. And it's probable the results can have higher relevance for the user. Commercially, collaboration has worked best for targeting mass goods for individuals, which is why it works well for Amazon.

But the choice of algorithm varies based on the nature of the data (a lot of this stuff tends to be fantastically sensitive to the data and how the data is represented). Think about how useless a Bayesian spam filter would be aggregated across a 100,000 user data set up on Bloglines. It could be much better to work against a couple of users you trust and some candidate data of your own to seed the algorithms.

"By the way, why does every interesting wide spanning web service idea eventually end up sounding like Hailstorm?"

Probably the reason they all start to sound like Hailstorm is because they all work on the basis that the computation has to be done on the server against large aggregate datasets. One place, one owner. Cue the consequent privacy concerns. A few years ago, when asked how the trust problem could be solved, a senior executive from Egg bank had an immediate answer - "Branding". The extent people will trust your organisation with their information is largely based on their current perception of your organisation. That's not quite the same thing as branding, but you get the idea.

What do you do with all that information you're generating 24x7? How do you convert it to value? Today's answer is to sell it to the people who have something to sell or messages to tell. The money's not in whatever it is you're offering to users to gather up the data in the first place (like search) - the money's in the side effects. And while converting the data into value for you or for those who want to sell something, the users must not think they're being sold out. Or they're gone. Something of a highwire act - and you only get to fall once.

One way to allow highy specific user information to inform the filters on the user's device, not someone's VC-backed server farm. Really, that's a social solution.

It could be much more interesting to sell this technology directly to users for 5 dollars and let them run it on their phones against the data of their choice. To do that requires a certain amount of letting go of ways of doing things, right through from client-server technology to business models based on TV and print media. The current situation is hopelessly dependent on those systems of buying and selling.

The social networking phenomenon is interesting insofar as it attempt to join users to users or rather than users to services to advertisers. The next step is to get those lumbering servers out of the way and let people interact directly. That will require more imaginative and disruptive business models.

April 04, 2005

Be Clear

Brian McCallister tells a story about why clarity might be important in a programming language:

"It reminds me, a great deal, of a conversation I had with a really bright guy who re-implemented (okay, actually pre-implemented, or co-implemented) something very popular in the open source world (written in C) in ocaml. After beating on it for a while he concluded that the basis for the whole design was broken, but he attributes being able to see why the whole design was broken to the expressiveness of the language, not to any abstract conceptual model. The C version is in widespread use, releases bug fix versions quite frequently, and a lot of people wonder if it will ever actually be stable." - Expressiveness Matters

This reminds me of Jonathan Sobel's classic paper "Is Scheme Faster than C?". When I linked to that paper here, Jonathan left the following comment:

"It's still true. In the years since I wrote that little blurb, I have used the same kinds of techniques for everything from programming languages research (such as http://www.cs.indiana.edu/~jsobel/Recycling/recycling.pdf) to high-reliability, high-performance systems (even at the device driver level) in commercial systems. It still works. If your solution is clear enough, it will be obvious how to optimize it; if your solution is a prematurely optimized mess, you won't be able to figure out how to do anything to make it significantly faster."

April 03, 2005

RDF hacking for fun and profit

I have to say, after a few years in the wilderness, coming back to RDF to do some hacking has been both fun and instructive. So, what's changed?

The community. I'm slightly older and a lot less cynical about the whole technology after being very excited about RDF around the turn of the century. I became pretty annoyed at the direction the RDF community was taking starting in 2001; by 2002 I had lost much of my interest. During that time I moaned a lot and generally wasn't very helpful (sorry). The other thing that's changed now is that the community's expectations seemed to have settled to something sane, especially around the extent and value of formal approaches on the Internet. The whole DL and formal logic gung-ho seem to have eased up a lot in the last two years, thankfully. No doubt some people felt that was a necessary growing pain for the technology, but it was just as much a pain to have to have really smart KR people tell you you were wrong, wrong, wrong, at various levels of politeness, when you wanted to get something useful out the door and iterate. Especially tough if you knew your AI history and where the whole KR sheebang could end up versus what counts for deployment on the Web.

The tools. The tools are so much better now. I've had Jena in a small-scale production environment for over 6 months, acting as the ham in an XMPP and Hibernate sandwich. It works a treat. At some point they might need to go back and clean up the APIs in a breaking way - there's some junk DNA lying about, understandable as the API has travelled through about 2.5 iterations of RDF at this point. But the core implementation seem to be solid. I find 4Suite to be stable software (tho' I'm not sure the RDF stuff is active anymore - Uche et al have been working on anobind most recently) I've been using rdlib and sparta recently and those are very neat. Sparta is in good shape for a 0.7, and the rdflib API is rather beautiful (tuplespace fans will love it). Dave Beckett's Redland is really impressive; the amount of work that has gone into it is incredible. Short version: the amount of work done by the RDF community in the last couple of years is humbling.

The web. The web is now more machine-oriented than a few years ago. Much more. The RDF community saw this would come to pass before anyone else, I think, but perhaps not quite in the way it has turned out - RSS, WS and REST-as-deployed, rather than intelligent software agents. Even so, those technologies are likely to start creaking on the data front - arguably WS and REST-as-deployed already are at that point. As the networking and application protocol work gets bedded down, the new low-hanging fruit becomes extensible data formats sprinkled with semantic constraint pixie dust rather than type annotations and namespaces (media-types remaining useful). RDF-Forms, some people's re-examination of description languages, and the interest in speech acts are just the beginning.

Shipping. I'm not sure how useful RDF is for explicit data representations over XML and relational tables, but as an internal format for applications and machine level chit-chat it is a decent option that you could be looking at before rolling your own configuration formats. Less code, more data. Now, people will point at how Mozilla's RDF is a millstone, (and they would be right), but we are 5 years on from that - the use idioms are a known today. You can even write something approaching sane RDF/XML once you avoid that nasty striping idiom.

Potential. My current work on desktop client using RDF to manage application state makes me think that a simple reasoner (a la cwm) could get into a mobile device within two years and such a reasoner is possible now for desktop aggregators, albeit being a tough enough programming exercise. And when you're done what's still needed there is a reporting language from which to drive the views. But if you had all that? Then that could push the kinds of things the folks at Nature have been doing right into the client (the way Nature is using RSS is extremely cool, and also well beyond the commercial state of the art). Everyone would get the equivalent of an an embedded SQL engine inside their aggregators working over their RSS data. Such reasoners available for consumer-grade software would turn the industry being built on RSS infrastructure on its head, as the ability to innovate with data would accelerate drastically. Imagine being able to cross filter and repurpose data on your phone instead of waiting for Technorati, Amazon or Yahoo! to get round to providing a cool new service. Or put another way, why wait for the services when you can generate the same views locally? (and then SMS them to your mates). The market emphasis could shift from rich clients to rich data very quickly and would, I imagine, force Web2.0 businesses to expose their data much more transparently than happens today (otherwise they don't get to participate in the user's views). If that happens, the current extensibility models available today in RSS and Atom might not offer any competitive advantage - writing new code and upgrading the aggregator is going to be too slow to matter. In this regard I think the WinFS approach was boiling the ocean. WinFS is like EAI for the desktop, when a few hacks and a webserver would get most of the way there. It would have been enough to have reporting and searching for incoming RSS data built into the desktop as a first cut. A smarter filesystem could have been done later after the approach was proved to work and after you had proxied a My Documents feed behind an IIS daemon.

Anyway, enough analyst-speak :) All in all, I would say this RDF stuff is just about ready for a second look. The big question is whether the world can get past the Semantic Web hype and bluster from years gone by to see the value.

April 02, 2005

Representing HTTPLR exchanges in RDF

In the HTTPLR protocol, there are a few resources of interest that let us reason about a message exchange:

  • The URL of the message to be downloaded or uploaded
  • The URL of the exchange
  • the current state of the exchange
  • The authentication mechanism (digest, basic)

Outside the protocol proper, we'll also be interested in the following:

  • A pointer to the 'local location' of the message
  • The media type of the message

So here's the data in RDF/XML format:

      <httplr:message rdf:about="http://www.dehora.net/test/httplr/sub1/msg1.xml" >
        <httplr:exchange rdf:resource="http://www.dehora.net/test/httplr/sub1/msg1.xml?exchange"/>
        <httplr:state rdf:resource="http://purl.oclc.org/httplr/state/created/"/>

And as a graph:


It turned out to be simpler than I thought to represent. As ever the XML isn't pretty, but the arrangement of information is clean. It does raise some questions that I don't have answers for:

  • Should this use a different property for media types? Actually, is there a vocabulary for media types? [update: Morten Frederiksen points out that DC is typically used to spec media types. Example changed.]
  • What about authentication vocabularies?
  • Should the current state be a property of the message URL or the exchange URL? Both?

Why is this useful? Well there are a few reasons:

  • State Engines: that RDF graph has enough information so it can used by a crashed agent after a restart to continue an exchange. Notably, it could be passed amongst HTTPLR-aware nodes and they could pick up the exchange with no fuss*. Dare I say it, but that's pretty cool.
  • Administration, administration, administration; We can build management tools around HTTPLR message exchanges without requiring the equivalent of WSDM or JMX. It's a no-brainer to stuff this data into an Atom feed.
  • Extensbility: I can keep adding properties to this graph without breaking existing code ** or worrying about defining yet another extension mechanism. This is real extensiblity folks, not the modular seperation of concerns stuff you get in RSS and Atom.

Probably, an example like this will go into the next HTTPLR draft as a non-normative appendix.

* This is the kind of thing that REST people are banging on about with regard to self-description, statelessness, and also why uniform interfaces matter. A WS approach would have to expose specific methods to support this. In REST we can carry on with uniform methods.

** And This is the kind of thing that RDF people are banging on about with regard to partial understanding and extensiblity. It's also why RDF doesn't need mU or mI.

Python development: UserDict as object scaffolding

I have a habit, when working in Python, of starting classes by extending UserDict, usually because I dont have a strong idea of where I'm going just yet. The UserDict acts as a scaffold. So I might start with something like this to fill out against the initial tests:

    class ExchangeState(UserDict):
      def __init__(self, msg_url='', exchange_url='', httplr_state=state_UNKNOWN):
          self['msg_url']=msg_url # the URL of the message
          self['httplr_state']=httplr_state  # the current exchange state
          self['exchange_url']=exchange_url # the HTTPLR exchange URL

As I'm working the code, some of the dict keys will get lifted to object fields:

    class ExchangeState(UserDict):
      def __init__(self, msg_url='', exchange_url='', mtype=None, httplr_state=state_UNKNOWN):
          self['msg_url']=msg_url # the URL of the message
          self['httplr_state']=httplr_state  # the current exchange state
          self['exchange_url']=exchange_url # the HTTPLR exchange URL
          self['mtype']=mtype # the message mimetype  
          self.msg_url=msg_url # the URL of the message
          self.state=httplr_state  # the current exchange state

Eventually, the dict scaffolding will be taken away, leaving the object:

    class ExchangeState:
      def __init__(self, msg_url='', exchange_url='',mtype=None, httplr_state=state_UNKNOWN):
          self.msg_url=msg_url # the URL of the message
          self.state=httplr_state  # the current exchange state
          self.exchange_url=exchange_url # the HTTPLR exchange URL 
          self.msg_mtype=mtype # the message mimetype  

Of course sometimes there is no lifting, and the class gets left as an extended dictionary. But I'm wondering, does anyone else develop classes like this? I'm finding it a very natural way of working.

Turn and then attack

Another best post ever from Ryan Tomayko:

"'Turn and then attack. That's it?' Jon asked, to which the dumb kid replied, 'Do you think I'll pass?'" - Insects and Entropy

In passing: "That is the essence of software engineering I think. Its not about writing cryptic programs to show how smart we as programmers are. Its about finding elegant forms of expression that maximimise our return on behavioural complexity" - Sean's comment Simplicity on the attack

Vaguely related: "A general stopped by to give us a little speech about strategy. In infantry battles, he told us, there is only one strategy: Fire and Motion. You move towards the enemy while firing your weapon." - Joel Spolsky's Fire and Motion

RDF datatypes, literals, quads


Mark Nottingham is wondering:


I'm talking about RDF datatypes, of course. As far as I can see, they're a special case to the data model; although the datatype itself is identified with a URI, the property 'RDF datatype' isn't, and as a result you can't meaningfully talk about (as in, reason with CWM, or access with most RDF APIs) them using that oh-so-delicious subject, predicate, object triple."

The charter when I was on the RDF wg said, when you got down to it, that RDF had to play nice with XML Schema. That was back when you could remember that XML Schema was meant to be a simple replacement for DTDs and just before people starting seeing serious problems with that technology (ie, it may not be sanely implementable). RDF Datatypes attempted to cover that requirement off.

Anyway, if the RDF wg didn't address that, others would, over and over. Some folks are deeply, deeply attached to data typing - Web Services proves that beyond question. It does not matter whether they are needed or even appropriate, people want data to have machine based types. There's a lot to be said for pre-empting that desire. For example Atom is taking much the same form of pre-emption with link types in the use of atom:link[@rel].


The literals are another special case that RDF datypes try to cater for. XMl literals in particular proved to be quite hairy; I seem recall a few calls with Jeremy Carroll at the time while we were sent off to figure something out.

There's a been a lot of back and forth on whether literals should be subjects of RDF statements. Some people think that anything worth talking about should have a name - so name it. Others will point out that a huge amount of legacy blob data exists out there that RDF excludes to some degree. Consider that you 'can't talk meaningfully' about HTTP representations in RDF either; that's probably a bigger problem than datatype inelegance.

However none of this type stuff hurts a whole lot for real work, as RDF processors treat type information as inessential - it's optional metadata.What's likely to break is your application making unwarranted presumptions about what information will be available (if you haven't learned your data typing lesson from Web Services at this point, well... mU :)


Consider another, more significant problem RDF has. I'm currently integrating Sparta (Mark's RDF library) and rdflib into a desktop application, and I can see that soon I'm going to run into the situation where A says X Y Z and B says X Y Z and I will want to preserve the provenance of those two statements as coming from A and B.

The problem here is a straight up loss of information - you can't easily ask 'who said X Y Z?', without the context of the statements. I've never worked on a real-world application of RDF that didn't come up against this issue. Solving it in pure RDF is very clumsy; APIs tend to add fourth item to the statement, often called 'quads', but that can rope your data to the API in question, which is definitely not the point of using RDF. Plus the meaning of quads isn't nccessarily shared between systems. I'm hoping not to have to switch to 4Suite to solve this problem; 4Suite is a big full-featured API and I want to keep things as light as possible. If I can.

April 01, 2005

There was only one catch and that was Catch-22

From the Joseph Heller school of specification:

"We define Binary XML as a format which does not conform to the XML specification yet maintains a well-defined, useful relationship with XML."

To which I say: LOL.

Tim Bray and Elliotte Harold are not as amused as I am by the looks of things. Tim thinks these folks are heading down a slippery slope: I wager that slope begins with the Infoset. Amy Lewis has the best comment so far:

"There seems to be an off-by-one error in the URI"

XInclude: it depends

Norm Walsh points to a hole in how the XInclude and xml:base specs interact:

"I think what pains me most about this situation is that XInclude was in development for just over five years. It went through eleven drafts including three Candidate Recommendations. [...] If we can't get a 16 page spec right in three CRs, what hope do we have of getting the XSL/XML Query family of specifications right?"

Yow. Between this and xml:id|c14n, I wonder if there isn't a process issue with the core XML work. That's twice a group of baseline specs haven't been specced to work properly together. And twice backward compatibility is being preferred over getting things straight. Over time that is going to have to paid back in some form of technical debt.

It's not so much the size of the specs that matter as the surface area of the interactions between specs. I see Norm has the Dijkstra testing quote at the top of his entry, but I'm not sure Dijkstra had this class of coordination problems in mind.

One thing that's been bothering me for a while: it does not seem that XML integrates well with other XML in the general case. That is when you move past XML1.0 and into the XML 'family' of specs, things seem to unravel ever so slightly. I worry that the XML family has essential composition problems unless you stick to a flat, dictionary-like structures a la Atom or RSS.

Jini up in non-geological time

The new Jini starter kit is great news that that community recognizes the barrier to adoption. I caught some heat about Jini failing the ten-minute test last year - contrary arguments along the lines of, this is a neccessarily complicated problem - I just didn't buy those. They've also sorted out the Jini licencing, by moving to the ASF version 2.0, more good news, as that takes the confusion out of anyone's obligations to Sun in production scenarios.

This comes via Tim Bray, who is doing some cool sounding networking thingy skunknamed Zeppelin - wish he'd tell us more ;) He's wondering whether it's the simplest thing that could possibly work. That depends on what range Zeppelin is meant to operate at. Jini is a LAN/Enterprise range technology that could be pushed out to the WAN if you hacked an XMPP transport underneath it (the beauty of the ASF licence means that you can do this now). The thing that seems to make Jini neccessarily complicated beyond that is Java code sharing, with all the attendant security and versioning issues, and discovery (but which could be addressed via zeroconf - maybe).

JXTA went after the internet range and for data sharing, and in terms of the protocols at least, dropped the Java dependency. It's hard to argue that's not a safer bet in the long run. Something surely has to evolve on top of all that bittorrent traffic ;)