« Sideshow bar | Main | How to write a job spec »

RSQ: Really Simple Querying?

Adam Bosworth, with his 'S4' criteria for mass adoption on the web (simple, sloppy, standard, scalable) seems to have clarified the worse is better debate over web data formats. Mike Champion had an interesting comment on RDF and Atom's slopworthiness as compared to RSS over on David Megginson's related entry 'RSS as the HTML for data':

"Bosworth's presentation is very well worth studying in this respect. He says that successful Web-scale technologies tend to be simple (for users), sloppy, standardized (widely deployed in a more or less interoperable way, irrespective of formal status), and scalable. I don't think Atom or RDF meet these criteria. Atom's main value over RSS is supposed to be its FORMAL standardization, but apparently nobody really cares. (Tim Bray's 'Mr. Safe' has not appeared, but RSS interop and even extensibility is happening and making it boss-friendly in practice). RDF is not simple for ordinary mortals, and its scalability is unproven. (I have been informed that actual RDF systems handle sloppiness well, even though one would think that its basis in formal logic would make it brittle... I don't know how to evaluate that)."

I can only assume that Mike's referring to recent discussion on xml-dev among other things. Over there I agreed that RDF has simplicity/comprehesion issues, but pointed out with a few simple examples that RDF is lot more tolerant of partial and missing information than some people realise. For example Daniel Steinberg also commenting on Adam Bosworth's keynote, thinks that total agreement is a requirement:

"Bosworth predicts that RSS 2.0 and Atom will be the lingua franca that will be used to consume all data from everywhere. These are simple formats that are sloppily extensible. Anyone who wants to can use these formats to consume content or to author content. Contrast this with the Semantic Web, which requires that you get a large group of people to agree on the schema of everything."

In reality, what Daniel said is is not true about RDF - RDF was designed with the unexpected in mind. A lot of this misunderstanding has had to do with early hype about the Semantic Web, which has at times sounded suspiciously like AI reborn. It also has to do with the way the benefits of RDF have been couched - critically, when WS technology adoption was on the up and up, the emphasis of Semantic Web standardization within the W3C was on formalization of the technology rather than useful applications.

RDF receives its robustness and flexibility properties from its design, and two design properties stand out.

Graphs

First is the graph model that RDF is based on. All RDF data organized as a graph, different from XML tree based document structures and vaguely like relational databases, but without the idea of tables. The beauty of the graph model is that it is 'additive'. That means you can keep merging new items onto the graph without having to create new data structures to support new information. Using RDF as the data model, queries and merging operations end up producing new graphs as their results, in much the same way SQL query results are also tables. More importantly it makes for a clean programming model. It's extensible and uniform. It's also 'subtractive', which means you can take data out of the graph and leave a smaller graph behind just the same way you'd remove an item from a hashmap, but with the hassle of doing something like dropping a column or table in a database (in the developer trenches, adding or dropping database columns can be the stuff of nightmares). For scalability, breaking up large graphs of data into smaller ones allows us to physically distribute datasets.

The most interesting slide in Adam Bosworth's presentation are not just the ones that feature S4, but the diagrams which show queries being divvied across servers (thanks to Mike for the sending on the link). While, it's known that Google break out their indices across a cluster into what they call 'shards', Bosworth's model looks like the late Gene Kan's Infrasearch query router, now part of the JXTA project. As a counterpoint, Doug Cutting of Lucene and Nutch fame has said, more or less, that there's no great advantage yet to distributed queries across the web in this way over downloading and centralizing the indexes:

"Widely distributed search is interesting, but I'm not sure it can yet be done and keep things as fast as they need to be. A faster search engine is a better search engine. When folks can quickly revise queries then they more frequently find what they're looking for before they get impatient. But building a widely distributed search system that can search billions of pages in a fraction of a second is difficult, since network latencies are high. Most of the half-second or so that Google takes to perform a search is network latency within a single datacenter. If you were to spread that same system over a bunch of PCs in people's houses, even connected by DSL and cable modems, the latencies are much higher and searches would probably take several seconds or longer. And hence it wouldn't be as good of a search engine."

Whether downloading the web into a cluster for indexing is the way to go indefinitely remains open if the amount of data being generated exceeds our ability to centralize it. At some point Jim Gray's distributing computing economics might flip in favour of sending the query out after the data rather than trying to localize the data. William Grosso has wondered whether Gray's model invalidates semantic web precepts:

"Now along comes Gray, making an argument that, when you think about it, implies that the semantic web, as currently conceived, might just be all wrong. His basic point is that it's far cheaper to vend high-level apis than give access to the data (because the cost of shipping large amounts of data around is prohibitive). Since the semantic web is basically a data web, one wonders: why doesn't Gray's argument apply?"

Worlds

Second is the "open world" assumption of RDF. What that means is that not finding the answer to a query doesn't mean the query is false. For example if searching for an Atom entry's summary finds nothing and you conclude there's no summary for that entry, that's a closed world assumption. But in RSS1.0, which is RDF based, you'd conclude you don't have a summary to hand, not that it doesn't exist. The data might be incomplete at the time of asking. Dan Brickley describes this as 'missing isn't broken':

"Developers who come to the Semantic Web effort via XML technology often make an understandable mistake. They assume that missing is broken when it comes to the contents of RDF/XML documents, that if you omit some piece of information from an RDF file, you have in some formal, technical sense 'done something wrong' and should be punished. RDF doesn't work like that. Missing isn't broken. In the general case, you are free to say as much, or as little, in your RDF document as you like. RDF vocabularies such as FOAF, Dublin Core, MusicBrainz, RDF-Wordnet don't get to tell you what to do, what to write, what to say. Instead, they serve as an interconnected dictionary documenting the meaning of the terms you're using in your RDF documents."

The case of the Description Logics and ontology worlds coming to the Semantic Web and worrying over queries that will blow up in the engines is much like the case of the enterprise world coming to the Web worrying over type systems and discovery languages. The likeness is not fleeting - both the Semantic Web and Web Services advocates have been busy building competing technology stacks in the last decade. They have valid points and good technology but the need or demand for such precision in the Web context has been overestimated. As Pat Hayes put it:

"It is fundamentally unnecessary. The semantic web doesn't need all these DL guards and limitations, because it doesn't need to provide the industrial-quality guarantees of inferential performance. Using DLs as a semantic web content markup standard is a failure of imagination: it presumes that the Web is going to be something like a giant corporation, with the same requirements of predictability and provable performance. In fact (if the SW ever becomes a reality) it will be quite different from current industrial ontology practice in many ways. It will be far 'scruffier', for a start; people will use ingenious tricks to scrape partly-ill-formed content from ill-structured sources, and there is no point in trying to prevent them doing so, or tutting with disapproval. But aside from that, it will be on a scale that will completely defeat any attempt to restrict inference to manageable bounds. If one is dealing with 10|9 assertions, the difference between a polynomial complexity class and something worse is largely irrelevant."

Pat Hayes is an interesting person to have said that. He's a legend in the world of AI in the way Adam Bosworth is a legend as a software developer. Both have concluded in their own ways that the 'neat' orthodoxies implicit in Web Services and the Semantic Web are futile. Cleaning up the Web is infeasible.

If you come from an SQL/XML background the open world idea of everything being effectively optional is going seem weird and unworkable, but what it really means is that every addition of data is an extension act - extensibility is intrinsic to the RDF way of doing things, not something that gets bolted on as with mustUnderstand/mustIgnore. The same intrinsic nature goes for distribution of datasets. Since RDF data can be distributed across any number of nodes, the technical challenge is not scaling the database across clusters it's routing and distributing queries. Query routing is a special case of the kind of packet routing problems that occupy telecoms, peer-to-peer and internet engineers. Adam Bosworth is right, we need Really Simple Querying, but it's a bit early to rule out RDF as a good fit for returning the results or dealing with scale issues.


April 27, 2005 09:25 PM

Comments

Mike Champion
(April 28, 2005 11:32 PM #)

Great post, thanks for the clarifications. I know I am guilty of thinking of RDF as KR koolaid with angle brackets. Likewise I hear more about ontologies and DLs from SW people than this "simple and sloppy" SW approach you discuss.

I don't really see how RDF is going to support RSQ, but there are some good links to explore here.

BTW, the PPT is at http://www.webratio.com/images/20050408Bosworth.pps

James Governor
(April 29, 2005 10:40 AM #)

Thanks. that helps clears up somewhat of a misconception on my part. RDF formalism

maybe the RDF folks need to meme "we like sloppy".

patrick Logan
(April 29, 2005 09:37 PM #)

Some thoughts:

1. It's funny that AI is disparaged in the world of the fuzzy. I am not so familiar with description logic. What little AI I did was with frame-based languages (Carnegie Representation Language, KnowledgeCraft). The data/objects/rules rather than logic-orientation may have been more naturally supportive of open-world attitudes.

2. My understanding of RDF is that it is a concept with many specific representations. Is there an assumed canonical format for gathering the results of a distributed world-wide query?

3. I assume most data query applications do not need to consider the entire web as does a completely free-text application like Google. Subsets of Google apply just to images, just to news, etc. Presumably "just to calendars" is coming. Whether you centralize or decentralize the query, another interesting question will be the conventions for finding the appropriate data.

4. I keep going back to Tom Malone's Information Lens, Object Lens, OVAL series of projects. This was on a much smaller scale than the world, but he did some interesting work on partially-shared views of distributed semi-structured data. (http://portal.acm.org/citation.cfm?id=78916, http://ideas.repec.org/p/mit/sloanp/2209.html)

5. The more that subsets of semi-structured world data comes into focus for specific applications, the more "queries" will want to include (fuzzy) calculations.

6. Jim Gray's April 2005 ACM Queue article is another interesting tangent on this theme. (http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=293)

No shortage of interesting problems.

Dan Connolly
(April 29, 2005 09:47 PM #)

I picked up Don't Worry Be Crappy a while ago and have been advocating it just about every chance I get.

Murray Spork
(April 30, 2005 03:54 PM #)

In fact RDF was originally concieved by those from the "scruffy" side of AI. See for e.g.
this post by Guha

But then the "neat" side of AI started to become more prominent when DAML+OIL and then OWL were conceived. You can see evidence of the tension between these two camps coming to the fore in this thread involving Guha, Patel-Schneider and Horrocks

But really we need to look to the non-logicians to see how RDF is really being used in the way Bill describes - people like Uche Ogbuji. This is a great post by Uche that I think reveals exactly some of the powerful qualities of RDF that Bill elaborates upon above.

I've been saying for a while now that what the Semantic Web needs is more algebra and less logic. - for example the "additive" and "subtractive" qualities of RDF that Bill talks about.

This is the "bottom-up" Semantic Web and I don't think it is so different to what Adam is trying to achieve or even that different to Clay Shirky's vision.

Trackback Pings

TrackBack URL for this entry:
http://www.dehora.net/mt/mt-tb.cgi/1545

Listed below are links to weblogs that reference RSQ: Really Simple Querying?:

» Closed World? from franklinmint.fm
Bill de hra's just posted some interesting thoughts on Atom and RSS datastructures vs. RDF. He's quite right that RDF... [Read More]

Tracked on April 28, 2005 10:12 PM

» RSS: really something significant from Lorcan Dempsey's weblog
RSS has captured the headlines ;-) There have been a couple of major ripples in recent months:A9 introduced OpenSearch, a method for exchanging searches and results between applications. RSS is the format used for results. Adam Bosworth, the influentia... [Read More]

Tracked on May 3, 2005 08:59 PM

» Web 2.0 Weekly Wrap-up, 2-8 May 2005 from Read/Write Web

This week: business folk getting interested in Web 2.0, Adam Curry podcasting from 2.0 perspective, cool Web 2.0 'mini-apps', wrap-up of the adverts in RSS debate, Bosworth's Web of Data...

[Read More]

Tracked on May 9, 2005 09:48 AM

» Web 2.0 Weekly Wrap-Up from The Mediaburn Radio Weblog
Web 2.0 Weekly Wrap-up, 2-8 May 2005 . [Read More]

Tracked on May 14, 2005 05:57 PM