« FridgeCracker | Main | David Parnas at the University of Limerick »

First thoughts on a search project

Web search blows goats. Local search totally blows goats.

For the web case: we need to decentralize search by passing queries around from site to site (trackback chains, mod-pubsub, or hack the bejeesus out of mod_backhand) and allowing sites to generate metadata locally and publish it instead of having spiders reverse engineer from HTML (duh). No matter how fast you can do it; downloading the Web into a cluster and indexing it - in what possible world is that a good idea?

For the local case: same thing, except we do the indexing and monitoring by hanging listeners onto the OS. The plumbing and UI is different but the index material, metadata and plugin models for listerners and indexers should be much the same. We could do lan-wide index sharing over zeroconf, that would be fun, as would a tuplespaces model instead of using mqs or interrrupts. We can of course upload indices to the web or onto your phone.

Let's use RDF for the data. Having seen that people figure using SOAP envelopes is not insane for UDP discovery broadcasts, content management or systems integration, I figure RDF is as production worth a technology as any for search and query. Or possibly an RDF that uses WikiNames instead of URIs.

But basically, a) my continuous build thingy is going to be done in the next two months; b) I can't think of a fun mobile devices project, c) wiki, my favourite web technology is now owned by confluence and snipsnap, d) I badly need better search over all my stuff.

So I'm going to give this 12-18 months. Cool names solicited.

[air: alpha beta gaga]

February 27, 2004 12:50 AM


Brett Morgan
(February 27, 2004 03:17 AM #)

I'm curious how RDF honestly helps in search. Watching RSS, most people generate crap feeds. Honestly. Expecting people to magically generate good RDF descriptions of their sites is almost laughable. And the obvious gambit of writing some ai pixy dust to automatically generate RDF from someone's ramblings is enough to keep me chuckling for most of the afternoon...

(February 27, 2004 01:50 PM #)

Bill - great idea.

Brett - there's less magic needed than current reverse-engineering HTML. No pixie dust, it's possible to take considerably more advantage of data that is already available, and make it easy to add more. If you read the RDF Primer and think for around 30 seconds, the benefit of RDF for search should be pretty clear.

Vincent D Murphy
(February 27, 2004 07:19 PM #)

I agree with Bill and Danny.

The Google status quo works but surely there is room for improvement. Search would be better if it were decentralised, and each site should know how to index its data better than Google (e.g. excluding boilerplate, navigation, ads, and all the other crap in a HTML page).

I think anyone who can create a inverted index (e.g. with Lucene) for their data can also publish RDF; I imagine the effort required as being roughly equal.

Bill: continuous build thingy?

Jon Hanna
(March 3, 2004 04:24 PM #)

There are serious trust issues here though. Me, I like to get matched accurately with search engines, but then I'm a non-profit so unless someone comes across me through google, and in a mad fit of altruism buys me all the stuff on my wishlists, maxes out their credit card on amazon through me and then donates to the charties I plug from time to time really the more hits I get through google the less well I'm off. I'm happy about that if anything helps anyone, but it still costs more than it makes.
The for-profits though have already made the earliest mechanism for metadata publishing (the meta element) untrustworthy for anyone that doesn't have a vested interest in keeping honest. Even automated parsing of HTML ( la google) has to be paranoid. One of the main reasons pagerank works *relatively* well is that it's harder (though not of course impossible) to fake out than content.
So how do we either:
a. Keep people honest?
b. Route around the dishonest?
c. Both?

Instincts tell me that neither will be easy, but that if an elegant solution does pop into anyone's heads it'll be for routing around the dishonest rather than preventing dishonesty.

Trackback Pings

TrackBack URL for this entry: