« Where to go from here | Main | Factories like they oughta be »

Over there, maybe

Danny responds quickly on "where do we go from here"

On Databases and efficiency: "I suspect the practical upper limits of scale are well above what'll be needed in practice."

That's a 640kb argument if I ever head one :) I must say I really, really don't belive it. I'm thinking billions and billions of triples in a decade; even less. Ok, I'm exagerrating, and it's easy to just add another order of magnitude to score points - but having to interactively process 10s or 100s of millions of triples isn't far fetched.

Update: Danny left a great comment, I'm lifting the entire thing:

I guess I should have qualified that first sentence : "I suspect the practical upper limits of scale for_a_single_store are well above what'll be needed in practice.".

I generally agree with what you're saying here, but would emphasize spreading the latency question out - say you've got 1000 triples in each of a 1000 independent, remote stores, how quickly can you match a particular pattern?

I'm not sure how far the notion of response time in search engines generalises. How's this sound:

Customer: "McTodger and chips, please" [400mS]
(plastic tray appears)

- processing time 400mS, response time 400mS,

Customer: "McTodger, please"
Spotty Youth: "You want fries with that?" [100mS]
Customer: "yes" [500mS]
Spotty Youth: "You want a McCupOfTea with that?" [100mS]
Customer: "no" [500mS]
Spotty Youth: "anything McElse?" [100mS]
Customer: "no" [500mS]
(plastic tray appears)

- processing time 1800mS, *apparent* response time 100mS

Whatever, the fact that Google can do what it does is some cause for optimism. As is Elias Torres playing with Map/Reduce code.

Update: following some links, I found a paper on scaling Ingenta's storage. Leigh Dodds works for Ingenta and they use Jena+Postgres; he's been looking at the RDBMS scaling side for some time. I also found this claim by Michael Bergman: "It is truly (yes, TRULY), not uncommon to see ten-fold storage increases with semantically-aware document sets.". That's more or less has been my experience. So, maybe we need an order of magnitude?

It's not so much a question of how much data - it's a question of how efficient triples can be, compared to say, db tables operating over domain models, or a text store operating over inverted indices (technologies that have invested 1000s of man years and billions of dollars in making them efficient). Without that, the only way to justify a massive performance hit is a corresponding increase in functionality - one place where the semweb community needs to explain itself better.

This goes back to integration as well - where's the compelling story about how RDF can augment existing domain models? I've seen enough to say it's entirely a good idea, but I wouldn't bet a system design on it just yet. Maybe I need to catch up on the semweb engineering state of art; I'm easly two years behind.

On scale: "One of the features of the Semantic Web is that it's distributed (just like the web), so there's no need to keep everything in one place."

That much I know, but see the point I made about it being an engineering neccessity, as opposed to a feature. One word counts here - latency. if you believe the research Google have conducted recently, response speed matters to users more than anything. And I'm betting most of the time Google spend searching is due to data center latency. Of course Google, along with all other major search engines, are heavily invested in centralised storage. Then again, I've heard this "search speed is king" argument anectodally from time time over the years.

I guess if anyone can pull it off, they'll have an instantly disruptive technology for searching, one that would fit naturally with the interaction models of Mobile and IM technology, which are nothing like the Web's.

On this issue of time on the wire, I did some back of the napkin stuff a few years back for a project - iirc RDF/XML was the most efficient way to represent triples; that's probably due to XML namespaces acting as a compression algorithm for URIs. I remember thinking turtle plus namespace abbreviatons would the way to go; you get a second boost since it can parsed faster than markup.

The other option to reduce time on the wire is intelligent routing built on a network of notifications. That would supply a means for query services to expose what domains they can answer on, allowing you to route queries to them. The big assumption is that queries can be analysed - tough when most people only type in one or two words. But it might be useful for vertical client applications, such as music players (arguably Amarok already does this with musicbrainz).


November 12, 2006 02:53 PM

Comments

Danny
(November 12, 2006 07:45 PM #)

I guess I should have qualified that first sentence : "I suspect the practical upper limits of scale for_a_single_store are well above what'll be needed in practice.".

I generally agree with what you're saying here, but would emphasize spreading the latency question out - say you've got 1000 triples in each of a 1000 independent, remote stores, how quickly can you match a particular pattern?

I'm not sure how far the notion of response time in search engines generalises. How's this sound:

Customer: "McTodger and chips, please" [400mS]
(plastic tray appears)

- processing time 400mS, response time 400mS,

Customer: "McTodger, please"
Spotty Youth: "You want fries with that?" [100mS]
Customer: "yes" [500mS]
Spotty Youth: "You want a McCupOfTea with that?" [100mS]
Customer: "no" [500mS]
Spotty Youth: "anything McElse?" [100mS]
Customer: "no" [500mS]
(plastic tray appears)

- processing time 1800mS, *apparent* response time 100mS

Whatever, the fact that Google can do what it does is some cause for optimism. As is Elias Torres playing with Map/Reduce code.

Assaf
(November 12, 2006 09:50 PM #)

Bill, you're talking about the cost of making like queries with RDF vs relational. I think that "like" is a fallacy.

“I’m looking for a warm place to vacation and I have a budget of $3,000. Oh, and I have an 11-year-old child.”

If you could get gross data from the Web to answer this question, it will be orders of magnitude more complex than anything we've done to date.

You need to traverse an incredible amount of nodes for each resource to decide whether it qualifies, multiplied by the number of resources you have, before you can return a few result sets.

Or you can pre-calculate all that information and put it in a preference system, at which point your RDF query and RDBMS query are as efficient, but run on the same set of data.

Let's say we have three ventures trying to answer this question.

RTW decides to re-engineer the Web for all that information to exist, so it can pull it off. With a search engine that can traverse colossal amounts of information to answer just about any question. At a cost of about one CPU per query.

WEMU decides to do a weekend mashup, grab data from a few trusted sources, put them in the database, and slap a nice UI in front of it.

PRA decides to extend their page ranking algorithm to cover blogs, forums and other places where people talk about their experiences and findings.

Which one will be cash positive?

Post a comment

(you may use HTML tags for style)




Remember Me?

Trackback Pings

TrackBack URL for this entry:
http://www.dehora.net/mt/mt-tb.cgi/1982