« Database type switching | Main | Apples and Oranges »

More database type switching

I got some quick feedback on my question about how to treat disjoint types in an RDBMS, but it seems I left out some detail and I might have posed the wrong question altogether. To recap, there's an event structure as follows:

  class event:
    def __init__(self, what, where, when)
      self.what=what
      self.where=where
      self.when=when

whose 'what' value can be a string, a URI or an XML document. The thing is that these 3 types are disjoint and I was wondering what people thought the idiomatic way to deal with this issue was in an RDBMS.

Bill Seitz mentioned sparse tables where one of the 3 possible 'what' columns is populated for each row

"Or, maybe I'd make whatType, whatString, whatUri, and whatBlob fields in a single (sparse) table."

Aristotle Pagaltzis described a normalised approach and its potential runtime inefficiency:

"The clean, minimally redundant approach would be to use four tables, of which one is the 'event', table which holds only when/where pairs and a primary key, and of which the other three are 'what' tables whose the primary keys are simultaenously foreign keys to the event table. This way each datum can be stored in a properly typed column, without storing boatloads of NULLs as you’d have to if you did this with a single table having one column per type of value....Unfortunately, this is stupidly costly to query – you need three left joins in every single statement.Worse, you need the primary key from the event table before you can update any of the what tables, so you have to chatter back and forth with the database instead of dumping bulk statements on it."

Adam Vandenberg asked:

"Are you going to query against "what", or just process them when they come up in a query?"

So it seems to be the case that instead of thinking about a generified RDBMS setup here for disjoint types, we need to think about what needs to be done with the data. Ok, so of the 3 possible types (text, URIs, XML) two of them are candidates for querying against interactively:

  • The 'well-known' XML format is a candidate to be queried on as it has has a standard header set; that would be much more useful to capture as a table than as a blob. That way we can ask question like: "show me all the events where the XML header whose foo element (now a column) is 'bar'".
  • The URI is a candidate to be queried on since it is a name of some class of events: "show me all the events with a URI of 'X' since this date".
  • The text data I wouldn't expect to query against, just render.

That would tend to lead to them being kept in their own tables. In terms of volumes, I'd imagine we'd be seeing around 25,000 of these events each week, where about 80% are 'well-known' XML and for the sake of argument let's say we could flush the database annually leaving a running total of about 1,300,000 records.

Incidentally, Jimmy Cerra mentioned that in RDF:

"I'd use a NODE type and have the object be either a URIRef or a typed literal (definitely not rdf:type though)."

and if any of the RDF community are reading, you can multiple the above figures by 10 - each one of these events results in approximately 10 RDF statements. I gather that 6-10M triples is the state of the art in RDF storage but here we'd be talking about 13M statements, which I think would argue for partitioning the data in separate graphs.


August 11, 2005 03:42 PM

Comments

Jimmy Cerra
(August 11, 2005 06:01 PM #)

I am not a database expert. In fact I am just a lowly engineering-physics undergraduate, so hopefully nobody in the RDF community blames me! Everything I know came about from "chapter this/tutorial that" on the web for solving specific needs I have. Perhaps I should study it more... Does anyone know of any good books so I can be more aware of the issues and theory behind RDBS?

I am a little lost why an RDF-based solution is probably a lot more expensive. I'm probably being dense and need everything spelled out to me. If you could explain I'd appreciate it.

To me, RDF is a way of thinking. No RDF-specific backend is needed (although helpful sometimes). So keeping things in tables... a row is a group of statements about a single blank node, a colume header is a predicate, and the value of the colume is the object. If the value is a key, then it could be the database's version of a URIRef or functional property. You can then just use SQL to query the RDF data. So I brought RDF up to compare with how other ways of thinking about the problem.

Here's an idea. Probably a stupid idea though. ;-) How are the events being fired? Could you specialize the events into XMLEvents, URIEvents, or GenericEvents? Or... What about storing just "what" and "how" (where null=default format=well-known xml: the 80% common format) fields or columes? I am not a database expert, so why are these bad ideas?

Bill de hOra
(August 11, 2005 06:59 PM #)

Jimmy,

"Does anyone know of any good books so I can be more aware of the issues and theory behind RDBS?"

Chris Date's new O'Reilly book would be a good start. he's a hardcore relational man who thinks above the RDBMS tools as deployed. After that you need to pick an RDBMS - I like O'Reilly's oracle and mysql books.

"I am a little lost why an RDF-based solution is probably a lot more expensive."

Well supposing you had a URI event, it would look a bit like this in RDF pseudo-statements:

evtURI, type, "urn:foo:evt"
evtURI, what, "urn:foo:something"
evtURI, when, "2005-08-11 00:00:00Z"
evtURI, where, "urn:foo:there"

in an RDBMS that can be represented as one row at the expense of being domain and type specific - you will have a table called events that constrains the form of an event:

| name | what | when | where |

Ditto for an object oriented approach; you'll have one object.

Most persistent RDF backends use an RDBMS, so the above one row in your domain specific table becomes 4 rows because the columns are generic SPO types, not domain specific (event) types:

| Subject | Property | Object |

Modelling purity aside, that has a definite space and time impact. While it might not be a make or break difference, if your data storage results in a order of magnitude increase in volume, it's important to be aware and measure whether this impacts scalability (in this case we go from 1M items to 10M). This gets worse when you use blank nodes since you'll end up with duplicate statement data (a lot of RDF in the wild uses bnodes).

" How are the events being fired? "

When something happens the event is created as RDF. Shipping the stuff around is not the imagined bottleneck however - it's persisting and querying the event data.

"Could you specialize the events into XMLEvents, URIEvents, or GenericEvents? "

Not really - that would be like classifying bank transactions based on whether they involved RoundDollarNumbers, DollarCentNumbers, or DollarCentFractions. Since these event are coming from arbitrary computer systems and application layers constraining the event sources to preppare data according to type information like that would be brittle (imo). A large point of an EDA (event driven architecture) is to delay specialisation and have things listen or watch for interesting patterns in the data irrespective of their details. Thta's exactly why RDF is so attractive for this use case - it enables processing uniformity over arbitrary data without having to do much transformation or data cleansing.

The key thing is not so much RDF v Realtional, it's that the disjointness of values forces you into having an argument with existing toolsets. I think this will become more and more common as we send more and more semi-structured data about and expect systems to synthesize and make sense of it.

I am actually starting to compare RDF stores with relational ones to see if the difference is significant.

Rick Thomas
(August 12, 2005 05:06 PM #)

Why not detect the type when the event is made and route the events to three different tables - eventText, eventURI, eventXML? Then do separate queries and merge the results. Seems like the queries are going to differ by data type anyway.

Rick Thomas
(August 12, 2005 05:15 PM #)

Oops. I suggested the same thing as Jimmy. But I don't understand why that is brittle. If the events are uncorrelated there's no loss of structure or new implied structure with separate tables. If the events are correlated then maybe you need to generalize to capture higher-level events upstream.

Mark Cidade
(August 13, 2005 05:16 AM #)

I would use a total of 5 tables for this:

  • Event (EventID [PK], WhatID [FK to What], Where, When)
  • What (WhatID [PK], WhatType)
  • Text (WhatID [PK, FK to What], TextData)
  • XML (WhatID [PK, FK to What], XmlData)
  • URI (WhatID [PK, FK to What], URI)

For querying, the What.WhatType column would be used to tell which child table to get the data from (and what type of data it is). Also, instead of making the child tables' WhatID PKs, I'd probably use an InternalID column but still make WhatID a foreign key

Gary Bery
(August 13, 2005 06:19 AM #)

Use a single table since that fulfills most of your query requirements. Use 5 columns:

eventID | what | when | where | whatType |

where "what" is a blob. [You may not need the eventID field - that isn't clear from your description.] Here "whatType" is the type (URI, string, or XML format) of the "what" blob.

Write user-defined functions in your relational DBMS to access "what"'s internal data internal. Pseudo-code for a function to get a field value in an XML format blob might be:

function getXML(what, XMLfieldname)
  XMLGET(what,XMLfieldname,fieldvalue);
  RETURN fieldvalue;
EXIT

which then could be used in SQL queries:

SELECT eventID, what, whatType
FROM  myTable 
WHERE (whatType                = "XML"  ) AND
      (getXML(what,"lastname") = "SMITH");
So you're using the newer capabilities of SQL to inspect the internals of objects (blobs here).

Post a comment

(you may use HTML tags for style)




Remember Me?

Trackback Pings

TrackBack URL for this entry:
http://www.dehora.net/mt/mt-tb.cgi/1630