« Web resource mapping criteria for frameworks | Main | links for 2007-08-14 »

Phat Data

Data trumps processing

For the last few years, I've been hearing that multicore will change everything. Everything. The programming world will be turned on its head because we can't piggy back on faster chipsets. The hardware guys have called time on sloppy programming. We never had it so good.

We're doomed apparently.

I think that increased data volumes will impact day to day programing work far more than multicore will. A constant theme in the work I've done in the last few years has been dealing with larger and larger datasets. What Joe Gregorio calls "Megadata" (but now wishes he didn't). Large data sets are no longer esoteric concerns for a few big companies, but are becoming commonplace.

The use of RDBMSes as data backbones have to be rethought under these volumes; as a result system designs and programming toolchains will be altered. When the likes of Adam Bosworth, Mike Stonebraker, Pat Helland and Werner Vogels are saying as much, it behooves us to listen.


The first PC I bought had a 750mb disk (I paid extra for it). One of my favorite tech books is Managing Gigabytes, which was published in 1999. Back then Gigabytes were a big deal. My laptop of a few years later, which my daughter uses today, had, count 'em, *20GB* of disk. Today I have 120Gb of USB storage strapped on the back of my 60Gb T42p with velcro. Some time this week my new latop with 120Gb disk will arrive and I'm already sniffing about for a 160Gb USB drive, or maybe I'll strap another 120Gb unit. I have a Terabyte of storage around the house.

I find it's all very hard to manage, and filling that disk space is no problem.

In less than a decade data storage has fallen through the floor, but more importantly the amount of data to store has exploded. I don't have numbers, but I suspect the world's accessible electronic data is growing at a faster rate than clock cycles, available bandwidth, or disk seek time. If there's going to be another edition of Managing Gigabytes they'll have to skip a scale order and call it "Managing Petabytes". Managing Gigabytes just about covers personal use these days. Our ability to generate data, especially semi-structured data appears to be limitless.

Data Physics

The CAP theorem (Consistency. Availability. Partitioning. Pick two.) suggests that you can't have your data cake and eat it. Every web 2.0 scaling war story I've heard indicates RDBMS access becomes the fundamental design challenge. Google seem to be able to famously scale precisely because they don't rely on relational databases across the board. People experienced with large datasets say things like joins, 3nf, triggers, and integrity constraints have to go - in other words, key features of RDBMSes, the very reasons you'd decide to use one, get in the way. The RDBMS is reduced to an indexed filesystem.

Is this crazy talk? Maybe. Good luck explaining to data professionals and system architects that centralised relational databases are not the right place to start anymore. They work really well. There is a ridiculous amount of infrastructure and expertise invested in RDBMSes. Billions of dollars. Man-decades. Think of what you get - data integrity, query support, ORM, ACID, well understood replication and redundancy models, deep engineering knowledge. Heads nodding in agreement at your system design. Websites in 15mins on high productivity frameworks. Java Enterprise Edition. You'd seem to be crazy to give that up for map-reduce jobs, tuple models, and tablestores that can't even do joins, never mind there's zero support for object mapping or constraints. It's no small ask to let go of these features. Psychologically, the really hard part seems to be giving up on consistency. The idea of inconsistent data *by design* is odd-sounding thing to be pitching, no matter how many records you're talking about. You're in danger of sounding irresponsible or idiotic. But if CAP holds, and you have to distribute the data to deal with volumes, and want to make that data available, consistency takes a bath.

Data as a service

The usual next step to a database approach not cutting it is moving files out to a SAN, probably with the metadata and acces control in the RDBMS, so you can retain some of your toolchain. SANs will become very popular as they come down in price, but a SAN only solves the remote part of storage. Ultimately you'll need a distributed filesystem that allows data access to be logically untethered from block storage and mount points. The big volumes mean you need to be able to write data and not care where it went. And you need keyed lookup for reads built on top of the FS, not in the RDBMS (on the basis that an RDBMS with no joins, constraints or triggers is an indexed filesystem). That will end looking looking something like hadoop, mogilefs or S3 - a data parallel architecture.

On the other hand, if data needs to be distributed because there's so much of it, and managing a lot of data is consequently difficult, but not core to most business or personal operations, a data grid is a potentially huge utility market to be part of.

August 14, 2007 12:38 AM