Not even wrong

Stonebraker et al, The End of an Architectural Era (It’s Time for a Complete Rewrite):

"Because RDBMSs can be beaten by more than an order of magnitude on the standard OLTP benchmark, then there is no market where they are competitive. As such, they should be considered as legacy technology more than a quarter of a century in age, for which a complete redesign and re-architecting is the appropriate next step"

Stonebraker et al, MapReduce A major step backwards:

"As both educators and researchers, we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications. MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is:

  1. A giant step backward in the programming paradigm for large-scale data intensive applications
  2. A sub-optimal implementation, in that it uses brute force instead of indexing 
  3. Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago
  4. Missing most of the features that are routinely included in current DBMS
  5. Incompatible with all of the tools DBMS users have come to depend on   "

The problem is that comparing MapReduce to a DBMS is apples and oranges. They simply don't do the same thing, and that's aside from the fact that DBMSes are working on relational data whereas tools like MapReduce/Hadoop are working on semi-structured data; or that (R)DBMSes don't handle the data volumes technology like MapReduce can.

I see Joe Gregorio called the comparison a category error, but the thing to wonder is why people who know databases inside out are making basic reasoning errors. In that vein it's worth bearing in mind that Stonebraker is backing a "column-store" database architecture. If you believe Wikipedia's article that says BigTable is a colum store, then ideally you'd see a comparison between DBMS technology to BigTable, or just say, 'I wouldn't build MapReduce on top of a Column Store'*. 

The one thing I agree on is that MapReduce is not new. Carriero and Gelernter's "How to Write Parallel Programs" is about research work that goes back to the 1980s:

"We can envision parallelism in terms of a program's result, a program's agenda 
of activities or of an ensemble of specialists that collectively constitute the program.
We begin with an analogy.

Suppose you want to build a house. Parallelism - using many people on the job - is
the obvious approach. But there are several different ways in which parallelism might
enter.

First, we might envision parallelism by starting with the finished product, the result. [...]
In sum each worker is assigned to produce one piece of the result [...] This is the result
parallel approach.

At the other end of the spectrum we might envision parallelism by starting with the crew
of workers who will do the building. [...] In sum, each worker is assigned to perform one
specified kind of work [...] This is the specialist parallel approach.

Finally, we might envision parallelism in terms of an agenda of activities that must be
completed in building a house. [...] In sum, each worker is assigned to pick a task from
the agenda and do that task - and repeat, until the job is done [...] This is the agenda
parallel approach.

The boundaries between the three paradigms can sometimes be fuzzy, and we will
often mix elements of several paradigms in getting a particular job done [...] It's
nonetheless an essential point that these three paradigms represent three clearly
separate ways of thinking about the problem."

in their terms, MapReduce is a design pattern for 'result parallel' programming. What is new is that it works outside a research lab on very big data sets (since we're talking about industrial application, novelty isn't the point, utility is).

According to the article the MapReduce style is being backported into the universities so grads be can be trained up:

"For example, IBM and Google have announced plans to make a 1,000 processor cluster available to a few select universities to teach students how to program such clusters using a software tool called MapReduce. Berkeley has gone so far as to plan on teaching their freshman how to program using the MapReduce framework."

This is probably the most interesting thing the article has to say. I recall Yahoo staff stating that available skills are an issue for leveraging Hadoop/HFS. So a complete rewrite isn't enough; an entire generation of developers and data specialists have to be trained up on this paradigm; another generation has to be convinced otherwise that an RDBMS is the only serious option for data management.  Conceivably the result parallel model becomes as dominant a paradigm for data processing for the next 20 years as RDBMS/SQL has been for the last 20, but who knows.

 

* or point at PIG

Tags:

2 Comments


    I don't think Map-Reduce is intended to keep track of say, airplane seats, that are being sold by agents all over the world. The Map-Reduce system is apparently the way that Google handles the huge textual inputs from scanning the web. From this, they apparently generate huge indexes by using map-reduce, to support the massive full-text search that is the Google search product. The classical, and crucial, issues of referential integrity, ACID, and so on, simply don't come up. If Google drops a web site or two, only to recover it in few days, who will know? If the Bank of America drops your bank account, it's a different matter.

    Different inputs, different persistence times, different outputs, different latencies, different reliability constraints.


    "If Google drops a web site or two, only to recover it in few days, who will know? If the Bank of America drops your bank account, it's a different matter."

    Agree; most web systems prefer to be available. I like to use Cockburn's scale here; uses it for establishing the right process formalism:

    http://en.wikipedia.org/wiki/Cockburn...

    I think you can use something similar for determining what consistency guarantees the data should give you. Nonetheless the Consistency/Availability/Partitioning theorem will always apply. So if you must distribute, and must have ACID, you cannot always be available.


Post a comment

Your name:

Comment: