Saturday, August 11, 2012

"What is a Data Model" ...

... triggered by the following question, to which I could no longer resist the itch to write down my own take :

< quote >
I've read a blog post about what really is a data model, as used in the term "relational data model" (RM). It made the following points:

1. The implementation of a data model is a programming language.

2. The RM is not necessary. It is not necessary for developing software solutions, maintaining large shared databases, or any other purpose in the world of software development. Any software solutions that can be developed while employing the RM could be written without it, using other data models.


The first conclusion came from the analysis of Chris Date's definition of a data model:
A data model is an abstract, self-contained, logical definition of the objects, operators, and so forth, that together constitute the abstract machine with which users interact. The objects allow us to model the structure of data. The operators allow us to model its behavior. --C. J. Date, AN INTRODUCTION TO DATABASE SYSTEMS, Addison Wesley, 8th ed., 2003, p 15-16)
It concluded from this that the implementation of a data model is a programming language, whether a general purpose programming language or not.

I'm not sure if data languages (e.g., SQL and Tutorial D) qualify as programming languages. But maybe in a broader sense, we can say that data languages are also programming languages in the sense that we use them  to "program" (i.e., declare and manipulate) our data. So if only the relational data model had been implemented correctly, then the industry would have produced better data languages (i.e., the D languages). Am I right?

I don't know how #2 derives from #1, and the consequences of not using RM were not stated.
< /quote>



Some comments may be helpful.  Let's start with the beginning, and that's the given formal definition :

A data model is an abstract, self-contained, logical definition of the objects, operators, and so forth, that together constitute the abstract machine with which users interact. The objects allow us to model the structure of data. The operators allow us to model its behavior.

This is best illustrated by explaining how the relational model of data is indeed a "data model" (model of data) in this sense.  For example, how is the relational model of data "an abstract, self-contained, logical definition of 'objects and operators' " ?  Well, the essence of the relational model of data is summarized in that famous sentence known as the "Information Principle", at current times usually referred to as the more elaborate, but perhaps more accurate, "Principle of Uniformity of Representation" : "All information in the system shall be represented as values of attributes in tuples in relations.".  What this means is that "The relational model of data shall have relations as its sole building bricks.", or iow, the 'objects', in the case of the relational model, are relations.

Note carefully that the term "relations" is to be understood in its mathematical sense, i.e. a subset of a Cartesian product of domains.  (Also note that mathematics typically deals with binary relations specifically (i.e. subsets of a Cartesian product of exactly two domains), but this is not a necessary restriction and in the case of information/data management even a quite obstructive one.).  Observe that the term "self-contained" must be taken with a bit of a stretch here, since obviously the definition of what the relational model is, relies on concepts borrowed from mathematics.

It was not extremely accurate where I said "shall have relations as its sole building bricks".  In Codd's first papers, he referred to this thing he called "time-varying relations".  He recognized that information in a database varies over time (not a very world-shocking of course).  With the hindsight we may now call ours, it is easy to observe that there is an analogy with what the world of programming calls variables.  For some reason, that observation was only made a bit belatedly.  To cut a long story short : the 'objects' that the relational model is built on are variables that contain relations as their current values.  And just as in programming, where a variable that is declared to contain integers as its current value, is often called an "integer variable", those database variables of the relational model are commonly labeled "relation variables", "relvars" for short.

That covers the 'objects' part.  What about the 'operators' part ?  Well, there the relational model has the following to say :  The operators that the user shall use to manipulate his relational database, are variable assignment for updating, and the operators of the relational algebra for querying.

Wait a second.  I have no other means to update my relational database than through assignment to a variable ?  Surely, that is not true ...  Surely when I do an INSERT, it's really the INSERT that is carried out, and not some kind of 'assignment' of a brand new and completely enumerated relation ???  Well, at the implementation level, surely no DBMS programmer in his right mind is ever going to deal with INSERTs by first erasing a million rows (should be using the correct term "tuples" here though ...), to then re-add them along with the newly inserted row/tuple ...  But at the model level, putting it like that is indeed an accurate way of stating what goes on with database updating.  What does a computer do when it has to carry out the command 'i := i + 1' (i being an integer variable) ?  The machine takes the current value of i into a hardware register, executes an ADD or INCR machine instruction, and stores the new value in the memory location where i resides.  In relationland, (and always speaking at the model level) INSERT <tuples> INTO <relvar> is exactly the same thing as the assignment <relvar> := <relvar> UNION <tuples>.  Take the current value of the variable (which in this case is a relation), compute the relational UNION (analogous to the integer addition) of that with the given <tuples> (the 'increment' value in the integer counterpart), and arrange for that new value to be stored in the place where <relvar> resides.  The fact that the strategies for achieving this result (i.e. the algorithms, the _implementation_) looks totally different than in the integer counterpart, does not matter.  That's the implementation level, not the model level.

As for the querying part, the relational model thus says that the "querying" ("information retrieval") portion of the "manupulation" aspect, shall be done using the operators of the relational algebra (I presume the reader is familiar enough with this, at least as far as the subject of "What is a Data Model" is concerned.  So no further comments, although some people who know me would expect me to have a lot more to say on the subject of RA, and they are right :-) ).

All of these taken together (variables that hold relations as their current values, an assignment operator to update the current value of those variables, and an algebra to do computations on the relations held in those variables) are sufficient to provide the structure and the manipulation aspect that indeed "constitutes an abstract machine with which the user interacts".  Note that it is of course still an "abstract" machine, in that without a concrete implementation, there is of course nothing 'real' or 'material' for any user to interact with ...  (SQL could have been/become such an implementation, but alas it turned out more to be anything but a faithful implementation of this abstract "relational" machine ...)

So far for how the relational model of data is indeed a "model of data" in the given sense.  Let's contrast this with a few others.  The Entity-Relationship So Called Model, for example (ER for short).  What does ER has to offer in the area of structure ?  The answer to that is pretty obvious.  Entities and Relationships.  What does ER have to offer in the area of manipulation ?  Pretty obvious too : nothing at all.  That is the principal reason why ER does not deserve to be called a Data Model, according to the definition given.  That is the principal reason why Date says that ER is just a bunch of rectangles interconnected by lines with fluffy stuff on their end points.  The same is also true of most if not all of those other graphical modeling languages (Object-Role, Dimensional, ...)  Their "value" (and just so it be clear : I'm not one of those who claim that that "value" is equal to zero) is in how well they manage to illustrate structure, but they have nothing to offer in the area of manipulation.

A little side-note is warranted here on what would happen if "models" such as ER did indeed try to add something to address the manipulation feature of a "true" data model.  Well, presumably for an "ER database", it would be equally true that the information it holds will vary over time, and that there is therefore an inevitable need for some kind of variables.  And since there are two distinct "kinds" of structure (entities and relationships), it is not unrealistic to assume that an ER model for manipulation would have to provide variables of type "Entity" as well as variables of type "Relationship".  Updating the database would require both an assignment operation for Entity variables as an assignment operation for Relationship variables.  Information retrieval would, probably inevitably, require operators [making up an algebra] for computations on Entities, as well as operators [making up an algebra] for computations on Relationships.  Implementers would have double trouble.  This is one of the prime reasons why the relational model was such a stroke of genius : all the power that the user needs, comes from a single type of structure (the relation), and a single set of operators (the relational algebra).  Any model that wants to do better, starts with a light years handicap if it wants to have more distinct types of structure than the relational model has.  Implementers have more work to do, more opportunity to make mistakes, testers have more work to do, users have more to learn, longer learning curve, the list doesn't end.  It is quite unlikely that any benefit (if one exists at all) that could derive from departing from Relationland's Information Principle, could ever be big enough to offset these inevitable downsides.

That brings us to that other question "are other data models possible".  An alert reader will already have made up from the foregoing that the answer is indeed "yes".  For example, it is quite possible for there to exist a "Hierarchical model of data", or a "Network Model of data".  What would it take for such Data Models to be legitimally called a Data Model, according to the definition given ?  Always the same answer : they'd have to formally define constructs that say what kinds of structure the database user will be dealing with, and they'd have to formally define constructs that say what kinds of manipulation the database user will have to his avail for working with [values of] those structures.

For the remainder of the discussion, I'll not make any distinction between "hierarchical" and "network" data models, observing that hierarchical database technology was based on [mathematical] trees, that trees are a special class of mathematical graphs in general, and that "network" database technology was (perhaps a bit loosely) based on [mathematical] graph theory in general.  In what little still follows here, I'll call such a model the "Graphical" Model of Data.

Well, what does a graph consist of, structurally ?  It consists of Nodes and Edges.  Only if a graph happens to be, or is known to be, "connected", do we not need to spell out the Nodes explicitly, because we can infer the set of existing Nodes by inspecting all the Edges.  But that's a special case, and of course a Data Model that is supposed to have general applicability, must cover the general case.

And there you have it.  Like any possibly conceivable "Entity Relationship Model of Data", any kind of Graphical Model of Data is inevitably bound to offer two distinct kinds of structure : Nodes and Edges.  The very same disadvantage of double (implementer) trouble and double (user learning) trouble that applies to any "Entity Relationship Model of Data", should one ever come into existence, necessarily also applies to any "Graphical Model of Data", should one ever come into existence.

In fact I'm reminded of the kind of database programming I did day in, day out with the IDMS system, yesterday when I was young (and I'm being "slightly" dishonest calling that 'yesterday' - feels more like hundred years or so :-) ).  It offered indeed two distinct kinds of structure : the "Record" (node) and the "Set" (edge).  Adding a Record required using the STORE keyword.  Adding an edge required using the CONNECT keyword.  Or if the CONNECT was issued automagically, then it required controlling the CURRENT of Record.  Deleting a Record was ERASE, deleting an edge (Set) was DISCONNECT.

Anyway.  Regarding that "should one ever come into existence".  There is one angle, looking from which it might be claimed that the "Graphical" Model of Data exists already.  It might be claimed to exist in the heads of the authors of (various ?) graph-based data management systems that seem to be gaining some popularity these days.  But that's not a match to the vast building of academic study that has been carried out, and spelled out and published, on the Relational Model of Data.  And it's unlikely that any model will any time soon get to match the level of scrutinous inspection the RM has undergone, let alone match the quality and the usefulness of the RM.

And that latter bit remains true even of that crippled implementation of the RM that is known as SQL.



For completeness, I should also be giving my take here on the following statements :

I'm not sure if data languages (e.g., SQL and Tutorial D) qualify as programming languages. But maybe in a broader sense, we can say that data languages are also programming languages in the sense that we use them  to "program" (i.e., declare and manipulate) our data. So if only the relational data model had been implemented correctly, then the industry would have produced better data languages (i.e., the D languages). Am I right?

SQL was originally intended as a mere Data Manipulation Language, not a programming language.  And in fact, the very very first origins built on an assumption that there would even be no commands for database updating in the language !!!  Hence the 'Q' in the name, and also in the name of its predecessor SEQUEL, "Structured English QUEry Language".  One of the very first original ideas around the relational model was that the relational approach was [useful] to create a "relational view" of existing databases (which could perfectly be hierarchical or otherwise in nature), and the updating should be left to the existing technologies as they were ...

Anyway, in later decades, and in particular with the advent of SQL/PSM in the standard, SQL evolved more towards being a full-fledged programming language.  Not that it has been very successful in that respect in practice, but that doesn't change the fact that SQL has indeed actually become much more of a "complete" programming language than it was originally.



I don't know how #2 derives from #1, and the consequences of not using RM were not stated.

#2 (other data models are possible) does in fact not derive in any way from #1 (implementing a Data Model implies creating a programming language), imo.  The two are just two totally independent truths.

No comments:

Post a Comment