Vocabulary Design and Integration
April 08, 2007 | co.mments
There are two schools of thought on vocabulary design. The first says you should always reuse terms from existing vocabularies if you have them. The second says you should always create your own terms when given the chance.
The problem with the first is your are beholden to someone's else's sensibilities should they change the meaning of terms from under you (if you think the meaning of terms are fixed, there are safer games for you to play than vocabulary design). The problem with the second is term proliferation, which leads to a requirement for data integration between systems (if you think defining the meaning of terms is not coveted, there are again safer games for you to play than vocabulary design).
What's good about the first approach is macroscopic - there are less terms on the whole. What's good about the second approach is microscopic - terms have local stability and coherency. Both of these approaches are wrong insofar as neither represents a complete solution. They also transcend technology issues, such as arguments over RDF versus XML. And at differing rates, they will produce a need to integrate vocabularies.
XML doesn't do anything interesting for integration by itself - you need the transformations. The upside of the transformation approach is that it deals well with the psychology of term ownership - wanting to control the meaning of a word is almost instinctive - that lends itself to vocabulary design approach of term creation. The notion of vocabulary is introduced in XML via namespaces and schema languages.
The downside is that you will have to write the transformations, and test that the transformations do what you intended in terms of the data. Once you have a transformation between two formats it serves as an implicit specification of the canonical form of the the two formats, although that could give some formalists cause for indigestion. "It's ok, we have regression tests", offers limited comfort to said formalists.
Unfortunately, the RDF approach is often mischaracterised so let's try and rectify that. The key to understanding RDF lies in what is meant by the term "data model". The term needs calling out because the RDF meaning isn't the same as the (more commonly used) meaning in IT and software circles. In the RDF, the data model implies a formal mathematical underpinning, literally "a model of data"*.
While it's hard to discern what others mean by "data model" outside the technical definition used by RDF, the point is that RDF does not work in terms of local canonical agreements for a problem space, ie the domains of discourse for vocabularies. It works by defining a canonical semantics for all data, represented as graph structures. Thus you're welcome to represent some class of thing, say employee details, or some domain, say patient records, in any number of variant** ways in RDF, but they'll all share the data model. Whereas in XML the data models are arbitrary and typically unknown - a declaration is made the markup and schemata are about some domain and the programmers are expected to get on with it.
OWL also has a formal data model - arguably is has 3 such models, each more powerful than RDF's, and all somewhat tenuously linked to RDF via the notion of a class. RDF/OWL will allow you make statements about the relative likeness of things that you would otherwise state imperatively using a programming language. To manage differing vocabularies, you'd use constructs such as sameAs from OWL that allow you say that one thing relates to another in some way - indeed sameAs is probably the best known relation of this kind.
The main value of this approach is easy warehousing and data linking. Transformation code is replaced with declarations of term equivalence. While OWL can go further, and express notions other than term equivalence (such as classhood), how it manages term mapping is of most interest to integrators.
Notions of Vocabulary
This produces a counterintuitive result - RDF's and OWL's notion of "vocabulary" is very weak compared to XML's, and arguably it doesn't exist it all. It's unsual because RDF is more strongly associated with heavyweight vocabulary design approaches such as taxomonies and ontologies. What RDF has are groups of terms that happen to managed by differing communities, and how terms relate is governed by a uniform semantics and processing model. All the focus is how terms can relate globally, not on how they are modularised and organized for a domain. Thus it's common to see formats that reuse some or more terms from other vocabularies.
XML based vocabularies on the other hand exhibit wide variation in processing and semantics, often this is seen as a feature of using markup. XML documents are also isolated despite the shared syntax; the number of XML formats that mix and match vocabularies is small and reuse is infrequent; perhaps the most notable counter-example is the Open Office file format, now standardised as ODF which re-uses other specialised vocabularies such as XHTML and SVG.
The Atom format allows and encourages the use 'foreign markup' from non-Atom namespaces, which is a more flexible approach than previous XML standards. While we should not read too much into the naming of things, 'foreign markup' betrays a definite bias to vocabulary integration, never minding that a notion such as "foreign RDF" wouldn't make any sense ***.
The reason why "RDF v XML" or "XML v Microformats" doesn't get to why transforms are more widely adopted than inference as an integration technique has nothing to do with the relative technical value of the approaches - clearly you can various approach to handle vocabularies and data integration. The reasons are primarily economic and there are two such factors worth considering. First, a transform is the shortest critical path to integrating any two formats and most people typically only have to care about two formats at a given time; indeed on many projects teams won't have the time scope or budget to consider broader concerns. That the individual case is almost always optimized at the expense of the general case on a project should be no secret. Second, a transformation will be most familiar to integrators, in terms of approach, figuring out the risks, available toolchains, and costs. It is integrators who who are typically tasked with this work, the majority of which is actually better understood as data migration and not unification. Irrespective of whether a non-transform approach might in principle produce greater overall value, the transformation approach will tend to have more predictable local outcomes.
* Having a data model is valuable in terms of understanding formal properties and expressive power, but most people can and do get away without caring for the details day to day in much the same way the working programmer isn't overly focused on Turing machines or Relational Algebra.
** note that variance here also includes syntax
*** Incidentally, the Atom Working Group's consensus was that the second approach, term creation, was the lesser of two weevils.
April 8, 2007 02:55 AM
Post a comment
TrackBack URL for this entry: