Let them eat layer cake: flexibility versus clarity in data
December 31, 2004 | co.mments
Adam Bosworth says users wants three things from databases. In summary these are:
1) Dynamic schema so that as the business model/description of goods or services changes and evolves, this evolution can be handled seamlessly in a system running 24 by 7, 365 days a year. [...]
2) Dynamic partitioning of data across large dynamic numbers of machines. A lot people people track a lot of data these days. It is common to talk to customers tracking 100,000,000 items a day and having to maintain the information online for at least 180 days with 4K or more a pop and that adds (or multiplies) up to a 100 TB or so. [...]
3) Modern indexing. Google has spoiled the world. Everyone has learned that just typing in a few words should show the relevant results in a couple of hundred milliseconds. [...]
Danny Ayers suggest RDF triple stores as an option, with some caveats:
Focussing on RDF's features, it has a solid logical base which offers pretty much the same capabilities as Codd's relational model but with additional support for ontology-based reasoning. The interchange language is usable and tool support is pretty good. The Sparql language may not quite be finished but it's certainly comparable to SQL. These are all big advantages on top of Adam's checklist.
Overall this would suggest that RDF stores are potentially good DBs, on many points potentially much better than regular RDBMSs (or XML DBs) because of the more flexible model. But for this to be practicable it assumes the performance can be brought to a comparable level as RDBMSs, which if it hasn't already been done would I think would only be a small matter of programming.
Performance, no doubt is an important concern. It's easy to imagine that much of the effort in building a modern RDBMS is sunk into query optimization. And it's easy to argue that much (too much) enterprise programming is when you get down to it, implementing cache management strategies for data. Yet while inside the database may be a small matter of programming, at the application layer caching and performance strategies tend to take on architectural significance, which is one reason why good performance metrics are critical before diving in.
It's not just an issue with database technology. Assuming you could achieve Adam's requirements for data storage you would then need application software to stay current with an everchanging business domain. It's difficult to imagine a world with infinitely flexible data forms and data flows which did not result in even more change pressure on applications. Changing a database schema is not that difficult - the trick is keeping applications reading and writing from breaking under change.
Another possibility is that application software might have too much in built domain knowledge. This may sound counter-intuitive, but arguably the less domain smarts the code has, the less change you'll have to contend with.
Project management notions of software projects as industrial manufacture, and pre-1960s industrial manufacture at that, are also a concern. In that paradigm, change in what is to be built is by our standards, glacial, and things that are built are not expected to have to change at all. That's generally not how things actually happen on a software project. [The alternative agile approaches are still formative - Kent Beck is still learning how to get things done, going by the diffs between the 1st and 2nd editions of Extreme Programming.]
Flexibility v Clarity.
Being able to serve a planet is one thing, but as pertinent as any performance or scale issue is the trade off implicit in Adam's requirements between Flexibility and Clarity.
The case for flexibility is well understood. Yet anyone who thinks that conceptual clarity is not important in software isn't maintaining enough of it. Business domain models implemented as relational schemata tend to be clear enough (the exception being when the relations are managed in application software). Object Oriented approaches are slightly less obvious but many claim they are more flexible than relational ones and represent a good tradeoff over making a database the primary representation of a domain.
Clarity comes with a cost. The problem with explicit domain models is twofold. First of all they're models and are at best a facsimile of the world or your part of it; at worst they are actively misleading and constraining. Second is that they are snapshot of reality. For a while after going live the model will be relevant. After that it will need to be altered to fit the new reality or begin to rot with respect to that new reality. Even where the model is accurate and well formalized to begin with it still runs the risk of inducing arbitrary constraints as it falls out of step. Applications tend to crystallize around data structures - so while the system may have a strong articulation of the domain to beging with, and that is valuable for all kinds of reasons, it will tend to have difficulty staging relevant as the world modelled in the domain changes. In production scenarios it seems most of the effort is expended in solving the first issue (accuracy) - the second issue (relevance) is often let slide.
Infinite flexibility is a tough ask.
One thing you can say about RDF is that it is highly flexible. Another thing you can say is that it consequently lacks clarity, especially when you have a lot of it. Never mind domain specific entities, just picking out domain archetypes (persons, items, invoices, events) is difficult enough. Even where there is a good knowledge of the vocabularies involved, for all its formality, collated RDF data tends to be messy.
Recently I've had some experience at the non-planetary scale of trading off between the extreme flexibility of technology like RDF versus a domain model that a person coming after could reasonably be expected to understand. This was a system designed to trap and collect events from various software and servers at various layers (operating system services, servers, application code, networks, business-process events, you name it). It used XMPP (jabber) instant messaging as the backbone. The system ostensibly required two domain models, one for events and one for the other system/cluster that was been monitored (an eGovernment multi-channel messaging broker).
One design goal was to allow new components, services, message types physical infrastructure and so on to be added to the broker without requiring upgrades to the monitoring system - anything else would not be cost effective. An interesting design pressure was not being able to predict the nature of the sources or the events down the line or the questions that might be asked of them. That made a system of classification (ie a domain model) nigh impossible to establish in the short schedule allowed for by the project. For this, using RDF was ideal as it allowed us to name the things of interest and that they bore relation to each other without worrying overly about their detailed types and properties. RDF also provided a neutral interlingua; once something could emit events into the XMPP network it could be monitored.
Thus we came up with a hybrid approach - models of events, messages, agencies and the like were allowed to develop on top of the primitive RDF data as and when needed. Where necessary RDF was mapped onto database tables and objects so they could be manipulated. While some of the domain items look like they're going have a long life, others can be regarded as disposable. The upside to this is that you have clarity where it's needed. In one sense it was determined that flexibility and clarity requirements varied within the system, and we implemented accordingly.
As far as possible classification smarts are left over to the queries against the data. It doesn't matter that much whether something is a server or a transportation component until someone has a question to ask. This is another area where RDF falls down. Yes, there is Sparql and before that other SQL like languages, but again you're left iterating over raw RDF graph result sets, which is not always ideal.
It turns out that RDF is surprisingly cheap stuff to generate. The downside was that for purposes of communicating intent, ongoing maintenance and adding functionality against the collected data, RDF is not very pleasant to work with, at least not compared to SQL, Objects or XML. This is especially so at the presentation layer. It's also a different paradigm, and by using it you're technologically committed to yet another data model, directed graphs, along the usual suspects - objects, markup and relations. The cost of introducing a new model should not be underestimated. As a result RDF has been useful but not as cheap to manipulate as one would like.
Arguably we could have used an RDF store such as Kowari, the in built persistence mappings of Jena, or even XQuery, along with the RDF interchange. The reality is there's only so much new technology you can apply in one go without taking on too much risk, especially in a short time frame, whereas we had a good idea of what we were getting into with a relational store.
Infinite in all directions
The conclusion I'm coming to with RDF in production scenarios, and this applies to any technology that provides high flexibility is that we will want to be able to apply domain models on demand. Domain models, where possible, should be late bound. In other words you start with collections of relatively unstructured data and progressively filter it to what us programmer types typically call a Model (the M in MVC). It's not clear if that can work out in the enterprise or on the web, but it is clear that is how biological intelligence seems to work. Ordered layers have proven a workable approach in a variety of areas, such as robotics, machine learning, telemetry, log analysis and intrusion detection. It also seems to be the direction that Event Driven Architectures and Grid-like technologies based on lattices are taking.
I'd go so far to argue that this base layer of unstructured data and the layers or services that filter and weed are missing from the Semantic Web stack and that without them it is a non-runner outside the labs, much the same way symbolic AI was until people realized that the more primitive stuff underneath is a feature not a bug.
For RDF, that seems to imply a need for Object Graph Mapping, or Relation Graph Mapping (or both) to integrate RDF graphs with Everything Else and provide domain clarity whenit is needed. Better support is needed for query and storage technology for RDF as keeping domain smarts in queries is a key strategy. Kowari is a good start - it's very much an RDF tool for grownups. I just don't know if it's ready to be imposed on customers yet or can be compared in infrastructure terms to an RDBMS. There's not point in having all that flexibility if your ROI is burnt up on maintenance and knowledge transfer. Even so, there' s nothing out there I know of that addresses concerns about whether we can start working against TBs of RDF short of a fully decentralized P2P query network. I normally dislike generic arguments from scale as they're often a device to close off options rather than think about the problem. In this case however, I think scale is a valid concern.
Another issue will be process. A substantial chunk of software process and architecture is predicated on establishing a domain model. After a decade of pain bringing software testing up front to become a design technique instead of an afterthought, it's not clear we're ready to move domain modelling away from the initial phases. Once you get past the infrastructure side, the thread managers and transaction monitors, all middleware really is, is an articulation of a domain in code. Arguing for some type of dynamically generated middleware or for the death of middleware itself seems dubious if not foolish.
December 31, 2004 02:45 PM
Post a comment
TrackBack URL for this entry: