« Always Be Closing | Main | Bzzt Questions »

Data Parallel

More from Duncan Cragg: "It's scalable because of all the reasons I mentioned before: the cacheability of the basic data operations and their parallelisability through partitioning."

So that's partitioning of operations dealt with, and yes it's a huge feature - now, what about data? Recently I said in a comment on Joe Gregorio's blog* that RDF can be partitioned "N ways to Sunday". RDF people don't talk up this property enough, it goes overlooked.

Over there, Pete Kirkham points out that, "the problem with triples is you then have to do joins to create objects out of them". I'd go further and say the real problem is grouping them into tidy 'domain models' that OO devs and gurus like Eric Evans and Martin Fowler insist are a good thing. But once you partition RDBMS backed data (as most big web based systems end up doing, especially on their user accounts), you have to do the first bit (distributed joins) anyway. It seems that GOOG and EBAY have decided to accept this as a physical design constraint and thus are keeping data integrity constraints in the applications, replace the RDBMS with raw storage, however barmy that sounds to those of us working at smaller scales.

At that point, perhaps it becomes worth considering whether you actually need what an RDBMS gives you anymore, or whether you need a dumb store, a la BigTable. Perhaps the RDF guys should stop figuring out how to solve the RDBMS-Triple Impedance Mismatch Problem and start looking at alternative storage like Hadoop. Most RDF systems using relational databases are using them as dumb stores anyway, or at least they were last time I looked.

It also occurs to me I should do two things 1) review the current RDF toolsets, as it's been at least 2 years, 2) really write down what I like about RDF, as opposed to picking at its flaws, which I'm too prone to doing.




* Of late Joe is really starting to "open his shoulders", as we say in parts of Ireland. If you're not subscribed, do so.


April 8, 2007 05:45 PM

Comments

Duncan Cragg
(April 8, 2007 06:26 PM #)

Thanks for the linkage - it's very much appreciated!! :-)

Now, you would pick on one of the lines where I wasn't clear enough!!

I actually meant URI partitioning, not operation partitioning!

I've added 'URI' to the sentence...

Thanks again!


Duncan

chimezie
(April 8, 2007 07:19 PM #)
Most RDF systems using relational databases are using them as dumb stores anyway, or at least they were last time I looked.

You should take a look at the N3 relational model I. It was intended as a 'third' generation attempt at using a RDMS more efficiently than long skinny tables:

https://svn.rdflib.net/trunk/rdflib/store/FOPLRelationalModel/

The relational model: http://copia.ogbuji.net/files/N3RelationalModel.xml

stop figuring out how to solve the RDBMS-Triple Impedance Mismatch Problem and start looking at alternative storage

RDF on https://svn.rdflib.net/trunk/rdflib/store/BerkeleyDB.py

Morten Frederiksen
(April 10, 2007 12:02 PM #)

Please don't stop picking at flaws, I find that your picks tend to be well balanced and actually bring thoughts and "solutions" forward.

Post a comment

(you may use HTML tags for style)




Remember Me?

Trackback Pings

TrackBack URL for this entry:
http://www.dehora.net/mt/mt-tb.cgi/2072