Friday, August 24, 2012

"A difference that doesn't matter"

This question was touched on lightly in an aside of a previous installment, leaving the question mostly unanswered.  The previous post mentioned this to illustrate the concept of a relation as a set :

 +------------------+-------------+--------------
I:INT X C:CHAR = {(I:1,C:'A') , (I:1,C:'B') , ... }
         +--------------+-------------+----------

And then it went on to conclude that for any of those individual tuples, e.g.  (I:1,C:'A')  ,

"We can also think of a (single) tuple as itself being a set, namely a set of ordered pairs in which the first thing is an attribute name and the second thing is "some value drawn from some domain".

And then a sidenote was made observing that other sources ("Introduction to Database Systems") "say that a tuple is a set of ordered triplets", and it was added that "The difference between the two approaches is not [so important]".

A little post to expand on that "difference" and to justify the claim that indeed this difference "doesn't matter".

Indeed, "Introduction to Database Systems" speaks of "the triple <Ai,Ti,vi> being a component of [some tuple]".  We can recognize the Ai and vi parts, that's the attribute name part and the value part, respectively, and the "additional" Ti part is the name of the corresponding domain for Ai.

Thus, the "difference" is essentially one between

 +------------------+----------------------+--------------
I:INT X C:CHAR = {(I:INT:1,C:CHAR:'A') , (I:INT:1,C:CHAR:'B') , }
         +------------------+----------------------+------

and

 +------------------+-------------+--------------
I:INT X C:CHAR = {(I:1,C:'A') , (I:1,C:'B') , ... }
         +--------------+-------------+----------

Is that "difference" of great significance ?  Not really.  In both "alternatives", the LHS of the equation is a statement of the heading that the resulting tuples conform to.  In both alternatives, there is a "mechanical" way to compute the RHS, given the LHS (and given the precise definition of the domains that are used in that LHS, of course).  That "mechanical" way consists of :
  • process all members of the "first" domain
  • for each member of the domain, process all members of the "second" domain
  • for each member of the second domain, process all members of the "third" domain
  • ...
  • for each member of the "next-to-last" domain, process all members of the "last" domain
  • for each member of the last domain, spell out the tuple consisting of those "current" domain members.
It should be obvious that the only difference between the two is in the details of the "spell out" part.  The first approach includes "spell out the domain name", the second example doesn't.

The "mechanical way" just described is, in a sense, the way to go from the LHS of the equations to the RHS.  One could also, somewhat philosophically, contemplate things that could happen if one ever wanted to "go from the RHS of those equations to the LHS", so to speak.  "Derive the heading from an enumerated cartesian product", so to speak.  In that case, there seems to be a more important difference.  The second example simply doesn't allow us to do that ... at first sight.  Indeed, no domain names are mentioned in the RHS of the second example.  So how do we obtain them ?  Rather simply : generate them.  Name them "1", "2", "3", ...  Those names don't have to be meaningful when dealing with questions that were only philosophical to begin with !

Anyway, it should be clear that both approaches are able to "cover the same ground", and that therefore whatever difference might be perceived between them is indeed only perception, superficial, and is indeed a difference that doesn't matter.

Friday, August 17, 2012

"Relations of nothing at all"

In a previous installment, it was explained how relations, in the sense of the Relational Model of Data, are a generalization of the binary relations that are typically studied in mathematics.  Obviously, that "generalization" implies that such relations consist of "pairs" that have more than two values (drawn from the domains that were used to form the cartesian product), but the possibility was also alluded at to have "pairs" with only one value for its members, and even the possibility of such "pairs" with zero values for its members.

How does that make sense ?

First, let's revisit the "trick" that the Relational Model applied in "replacing" the concept of "ordinal" pairs with that of "unordered" pairs, "decomposing" that trick and its consequences step-by-step.

Start with the cartesian product of domains INT and CHAR :

 +-------------+---------+---------------+--------
INT X CHAR = {(1,'A') , (1,'B') , ... , (2,'A') , ...}
        +---------+---------+---------------+-----

What those lines seek to express is that we "know" that the numbers in those pairs derive from domain INT, precisely because they are the first ones in the pairs, and likewise for the quoted characters.

(Aside : computer languages that support 'tuples' natively typically use this 'ordered' version.  Haskell is a case in point.  There is a "first" operator to "retrieve" the "first" value from a tuple.  I digress.)

But for reasons of practicality, we do not want "addressability" (of the values within the tuples) by ordinal position (well at least not in settings where we are dealing with "large shared databases"), we want addressibility by attribute name.  The way to do this is by prepending attribute names to each and every thingy in that equation :

 +------------------+-------------+--------------
I:INT X C:CHAR = {(I:1,C:'A') , (I:1,C:'B') , ... }
         +--------------+-------------+----------

Aside : The equation as given states the equality between (some kind of) "predicative" way of defining a set (left side of the equation) and the "enumerative" way of defining a set (right side of the equation).  An "intensional" versus an "extensional" way of defining a type if you will.  But pls don't proliferate that latter terminology as it may be grotesquely off-base with more common usage of those terms.  Anyways.  Going in detail on that would be digressing from my main point here.  Maybe something for another installment.  End-of-aside.

Do we now still need those "pairs" (n-tuples) to be ordered ?  Well obviously, no we don't.  As long as we "know of the association" of I to domain INT and of C to domain CHAR, there is no longer any meaningful difference between (writing down that first tuple as)   (I:1,C:'A')   and (writing down that first tuple as)   (C:'A',I:1).


As a consequence, under this notation, we can also think of a (single) tuple as itself being a set, namely a set of these things that are separated by the comma's in each tuple of the enumeration.  And what exactly are those "things that are separated by comma's" ?  Eurhm, well, they are ordered pairs in which the first thing is an attribute name and the second thing is "some value drawn from some domain".

Another aside : "Introduction to Database Systems" says that a tuple is a set of ordered triplets.  The difference between the two approaches is not germane to the main point being made in this post (which is tuples and relations of nothing at all).  Suffice it to say here that the difference plays on ( / is a consequence of ) that presumption that was spelled out "As long as we "know of the association" of I to domain INT and of C to domain CHAR".  Maybe something for another installment.  End-of-another-aside.

Okay let's see what happens if we explicitly re-cast that tuple as a set of ordered pairs.
That gives us   { (I,1) , (C,'A') }.   Braces in bold for emphasis.
It should be clear that if we take a cartesian product of n domains, we get tuples that are themselves a set of cardinality n.  And since set theory is not limited to cardinalities >=2, this implies that, at least theoretically, we could also consider tuples of cardinality one and tuples of cardinality zero.

(Side remark on terminology : the number of attributes in a tuple is more commonly referred to as its degree.  I use the term "cardinality" here because that term is more closely related to set theory, and I am stressing the concept of tuples-as-sets here.  It should be clear that my "cardinality" of a tuple and the literature's more common "degree" of a tuple are the very same thing.)

We'll look briefly at tuples of cardinality one, because there is little unintuitive about them, and they will show us how tuples (and relations) of nothing at all, indeed make sense, and how they even are useful.  A tuple of cardinality one could be something like   TUPLE{ (I,1) }.   They are useful whenever the information that we want to store or retrieve, is the logical counterpart of a unary predicate, i.e. a predicate in which there is exactly one "fill-in-the-gaps" place.  For example, "The average temperature tomorrow will be §I§ degrees centigrade.".  Or "the highest customer number assigned is currently §I§".  Or "A §make of car§ has passed through our street today.".  (This latter one merely to illustrate that one must beware of confusing unary predicates with "predicates of which only one single instantiation can be true.  The talk is of unary predicates.)

Okay.  One step further down the line of decreasing tuple cardinalities (and we're finally where the title suggested we'd end up).  A tuple of cardinality zero, is, then, "a set of attribute values that has zero members".  Iow, it is the empty set of attribute values.  Iow, it is the empty set.   TUPLE{ }, in our usual notation.

How many such tuples are there ?  That's pretty obvious : exactly as many as there are "empty sets" in mathematics ... exactly one, thus.  Because the zero-tuple is unique, we can give it a name.  TUPLE_DEE in what follows.  Now where can TUPLE_DEE be useful in data management ?  That should by now be obvious too : it will be useful in any place where the information we want to store or retrieve, is logically defined by a nilary predicate.  I.e. whenever we deal with information, the defining predicate of which has no fill-in-the-gaps spots at all.  These predicates are sometimes labeled "degenerate predicates", and actually they correspond to propositions.  The following examples are drawn from "Databases, Types and the Relational Model", appendix E : "The door is open", or "The alarm is set".  No §§ marks, no fill-in-the-gaps.

Suppose you'd have to design a database to record the current state of affairs for those two predicates.  In SQL.  Would you create a "type" that can represent open/closed ?  One that can represent set/not set ?  What would you be doing to prevent your tables from containing two distinct rows, one saying that the door is "open" and the other row saying that the door is "closed" ?  Or would you design a boolean attribute "true" for open and "false" for closed ?  Yet again, how would you interpret such a table if it is empty ?  How would you enforce the rule that there must always be exactly one row in your table ?  All of that involves much more complication than is needed or warranted.

All you really need for such a database, is a relvar (/table) with no attributes (columns) at all.  If I want to record the information that "the door is closed", then I just ensure the table is empty.  How does this work ?  Well, any tuple in a relvar represents a true instantiation of the relvar's predicate.  If that predicate is a niladic, "degenerate" one, then the only instantiation possible is, the predicate itself (which already is a proposition by and of itself).  If there are no tuples at all in such a relvar, then this means that there are no true instantiations (/no true instances) of the relvar's proposition, meaning the relvar's proposition is FALSE.  If there is indeed a tuple in such a relvar, then that means that there does exist a true instantiation of the relvar's predicate, and since the only possible instantiation of that predicate is the predicate (proposition) itself, it means that the relvar's proposition is true.

Observe that there are thus two distinct possible values that such a "niladic" relvar can take on : one possible value in which no tuple is present at all (that's just the empty relation), and a value in which the empty tuple is present.  The difference between these two is a bit akin to the difference (in mathematics) between   {} and {{}}.  These values (relation values) are often assigned a name too, most often "TABLE_DUM" and "TABLE_DEE", other times more briefly "DUM" and "DEE".

I coloured the "outer" braces red to illustrate that these are the braces "associated with" a relvar and its corresponding predicate (proposition in our niladic case) and the "inner" ones green to illustrate that these correspond to a tuple that makes an instantiation of a predicate (proposition in our niladic case) true.

When querying a database, it is often claimed that DUM and DEE represent "false and true, respectively", or "no and yes, respectively".  That is a little bit sloppy.  Here's why :

If a database relvar's predicate is "The door is open", and we query that relvar, then we get back either DUM or DEE.  If it is DEE, then TUPLE_DEE appears in it, but the 'meaning' that TUPLE_DEE carries in this case is "The door is open".  It is only because of human interpretation that we "know" that this means a "yes" answer to the question "is the door open".  If we ask a question "is at least one student enrolled on any course", then we could inquire an ENROLMENT relvar and project away all of its attributes.  Once again, the result we get is either DUM or DEE.  If it is DEE, then TUPLE_DEE within it carries the meaning of the projection, which is "there exists some student and there exists some course such that that student is enrolled on that course".  It is only because of human interpretation that we know we can answer the question asked with a firm "yes" ...

Tuesday, August 14, 2012

"What relations 'mean' " ...

In a previous post, it was established what relations really are (mere mathematical things, i.e. objects that play a role in some mathematical theory).  And the relational model of data uses exactly those things as a building brick for our databases.  And those databases convey meaningful information for their users.  Databases can contain information such as "An order for five screwdrivers was placed on 2012-08-01 13:45:56 by a person named John Doe for an order price of 7 local currency units, and this order was delivered on 2012-08-14.'.  That's certainly a very "meaningful" statement.  It leads many people to believe that relations must have some kind of meaning.  Alas, they are mistaken ...


Lets consider a relation that consists of the 2-tuple (DAY1:Saturday, DAY2:Sunday).   In tabular form, it could be depicted as
DAY1 DAY2
Saturday Sunday
and that's about everything that "characterizes" this relation.

But by and of itself, does this "mean" anything ?  Well no, it doesn't.  Claiming the opposite would be similar to claiming that "the number one 'means something', by and of itself".  No it doesn't.

Relations (and the tuples they consist of) do not have any meaning of themselves.  If the opposite were true, then any reader would be in the possibility of concluding from the foregoing table that some firm statement is true, say, for example, "Saturday is the week-end day that comes before Sunday, which is also considered a week-end day.".  It would also be true that all readers would come to the very same conclusion.  And there would not be any reader who would come to the conclusion that this relations means "On Saturday, He created man, and on Sunday, He took a break to admire His own works thus far." ...

None of that is the case, of course.  If relations carry any meaning, then it is because they appear in a certain given context.  Note the choice of the word 'carry', as distinct from 'have'.  It's an important distinction.  Relations and the tuples they contain are merely carriers of "meaning", and the same relation and/or tuple might carry very different meanings in different contexts.

(For more stuff on why it is a good thing that "relations don't mean a thing", see, e.g., here.)

When it comes to context, two distinct cases apply which correspond to whether a user is updating the database or whether he is querying (inquiring) it.

In the case of updating, "what a relation means" (and/or "what a tuple means") is determined by what database relvar the update is targeting.  And in particular by the external predicate that has been associated with that relvar.

If that predicate happens to be (*) "§DAY1§ is a week-end day that comes immediately before §DAY2§, which is itself also a week-end day.", then a user inserting the 2-tuple (DAY1:Saturday, DAY2:Sunday) in that relvar is in effect making the assertion that "Saturday is a week-end day that comes immediately before Sunday, which is also a week-end day.".

(*) The things enclosed in § marks are obviously intended as placeholders for some value-to-be-filled-in.  Which value that is, is, obviously, an attribute from a tuple that [in this case of updating a relvar] is being inserted into the relvar.  This process of filling in "concrete" attribute values in the places where such placeholders appear in a predicate, is commonly called " instantiating the predicate".  Instantiating a predicate (such that all placeholders are effectively replaced) yields a proposition, and inserting a tuple means asserting that that proposition is a true one.

If that predicate happens to be "On §DAY1§, He created man, and on §DAY2§, He took a break to admire His own works thus far.", then a user inserting the 2-tuple (DAY1:Saturday, DAY2:Sunday) in that relvar is in effect making the assertion that "On Saturday, He created man, and on Sunday, He took a break to admire His own works thus far. ".

Observe how the inserted tuple itself is the very same in the two cases, but how the "meaning" that the tuple carries in either case is entirely different.

Back to (a relvar with) the predicate "§DAY1§ is a week-end day that comes immediately before §DAY2§, which is itself also a week-end day.".  Now suppose some user wants to record the assertion that "Sunday is a week-end day that comes immediately before Monday.".  Would it be correct of this user to insert a tuple (DAY1:Sunday, DAY2:Monday) in that relvar ?  Obviously no, it wouldn't, as that would mean that this user would also be asserting the part that says that "Monday is itself also a week-end day" !!!

Documenting the external predicate of each relvar in a database is about the most important aspect there is to database design.  And of course as always, it is important to be precise !  Even if differences or variations seem extremely subtle, they can ultimately still give rise to major misunderstandings and mistakes !

On to the meaning of relations/tuples when inquiring a database.  When that tuple (DAY1:Saturday, DAY2:Sunday) was inserted into the relvar, that was taken to be an assertion to the effect that the corresponding proposition was a true one.  That 'meaning' is of course never altered merely by the tuple "staying where it is (in the relvar)".  Thus, if we inquire this relvar and get back that same tuple in the result, that still tells the inquiring user that "Saturday is a week-end day that comes immediately before Sunday, which is also a week-end day.".

Note once again that if we got back the very same tuple from inquiring a totally different relvar (one with a totally different external predicate), then this very same tuple would mean something entirely different.  Nothing changes to the fact that relations and the tuples within them are mere carriers of meaning.

But when we inquire a database, we typically use constructs that are much more complex than just "naming a relvar" (and getting back their full contents).  Those constructs are the possible expressions of the relational algebra.  What matters about that in the context of 'meaning', is that each possible expression of the relational algebra, also has its very own "external predicate" that "defines the meaning" carried by the tuples appearing in the result of the query.

For example, if we apply a RESTRICTion to some relvar, then that expression has for its external predicate the external predicate of that relvar, ANDed together (logically conjuncted) with a rule that expresses/defines the restriction condition applied.  For example, applying to our week-end days relvar a restriction condition 'DAY1 = Sunday', gives us the external predicate §DAY1§ is a week-end day that comes immediately before §DAY2§, which is itself also a week-end day, AND [it is the case that] §DAY1§ is Sunday.".  The empty result we'd be getting back from that query, is a manifestation of the fact that there simply are no true instantiations of this predicate.

Note that the querying too indicates why it is so extremely important to document the external predicates of our database relvars : without that, there just is no formal way for us to tell what our query results actually mean (and without such a formal way to tell that, all that's left for us to do is to make mere assumptions - but as the saying goes, assumption is the mother of all screw-ups).

Monday, August 13, 2012

"What is a relational database"

Various sources (see, for example, "An Introduction to Relational Theory") define a "database" as an
  • organized
  • machine-readable
  • collection of data.
Many alternative formulations exist with various degrees of difference in style or covered detail, but the essence is typically always the same.

From such definitions, we can derive (proceeding in random order) :
  1. A database is a collection of data.   Note in particular that a "collection of data" is nowhere near the same thing as "a software system that can be used to manage data" !!!  A mere collection of data is something notoriously passive, a collection of data does not ever actively do anything, but a software system is, contrariwise, a component that can indeed be very active.  What I'm getting at here is of course the distinction between the terms "database" and "DBMS".  A database is a collection of data, a DBMS is a software system that is used to manage that collection of data.  Not respecting that distinction can irritate some people quite heavily, so anyone reading this can now, once and for all, solemnly swear that they "shall never ever again use the word 'database' when what they actually mean is 'DBMS' ".  If the batteries of your camera have run down, then you don't say that it's the batteries of your photograph that need recharging either, do you ?
    Should you think this is unimportant or needless nitpicking, then think again.  It's important to be precise.  Why is it important to be precise ?  Because we deal with machines that do exactly as they're told.  And we programmers and software developers in general are the ones who are supposed to formulate the instructions that those machines will be following.  To be skilled at that, it is an absolute and inevitable requirement that the person doing that is, indeed, capable of being precise.  Lethally precise.  The more capable we are of achieving that "lethal degree of preciseness", the more skilled we will be at writing software that is actually correct.
  2. Back to what a database is.  It is an organized collection of data.  Meaning there must be some kind of structure to the database.  If there is no structure in a database, then what it contains is not data, but just garbage, and nothing or no one will ever be able to make any sense out of what it contains.  The term "unstructured data", "unstructured databases", "schema-less databases", "schema-less data", "schema-less data model", and what have you of that ilk, are all contradictions of terms, and nothing more than utter nonsense.  There always is some kind of structure.  It may be a very high-level kind of structure, it may be that it doesn't really express anything that comes comparatively close to the kind of facts that the end-user is actually dealing with in his world, it may rise skyhigh in its level of "abstraction", in some sense, but there always is at least some kind of structure, some kind of organization.
  3. And finally, it is machine-readable.  Some alternative formulations are about a database being an "electronic registration" of some sort, the essential point is the same.  (And comparatively unimportant, I'm tempted to add.  The sole purpose is to rule out the paper cards in a filing cabinet of elder times, as being a kind of database that should be taken into consideration when discussing database technology in computerized environments.)
That established, we can state what it means to be a relational database :
  1. First and foremost, it means of course being a database, meaning it satisfies the three criteria mentioned earlier,
  2. And two, it is a collection of data that is organized according to the relational model of data, meaning that it is an organized collection of relations.  More precisely, it is an organized collection of relation variables, variables that take on relations as values.  The "organization" takes on the form of each such relation variable (relvar for short) being assigned a unique name, owing to which it is identifiable in the database.
    Also to be noted is that each relvar in the database is accompanied by an intended interpretation.  It is precisely this "intended interpretation" that will allow us users to make any sense of what the database contains.  This intended interpretation is of no concern to the machinery (the DBMS) that manages the data, it is all the more important for us, users, who are "external" to that machinery.  That is why this intended interpretation is often called the "external predicate" associated with the relvar.
(And consequentially, at this point it can also be stated what it means to be a relational DBMS, namely in the first place to be a DBMS, i.e. a software system that can be used to manage data and/or databases, and in the second place one that has the purpose of managing a relational database in particular.)

"What is a relational table"

It's the kind of question that always tempts me to respond with brief answers like "a contradiction of terms", or "a misnomer" (and then leave it at that).

But that's not very helpful of course.  Hence a more elaborate attempt here to write down how I would personally answer the question.

First of all, the word "table" is a bit of an overloaded term in the world of relational, or pseudo-relational (SQL) data management.  In SQL, the word stands for two things, that are fundamentally very very distinct, but where SQL, mainly for historical reasons, failed to recognise that distinction.  Those two things are :

(a) an updatable object managed by the database.  (Creating such an updatable object, in its quality as 'updatable object', is done by CREATE TABLE ...)  In terms of modern relational theory, this could better be called a "TABLE VARIABLE".  The qualifier "VARIABLE" refers to the quality of being an updatable object, i.e. a thing whose state (aka current value) can change.
(b) the actual content contained in such an updatable object.  Those are the things that are usually depicted in the relational (and pseudo-relational) textbooks as cells holding a value (but sometimes not) and separated by some lines :

+---------+----------+
! NAME    !    SCORE !
+---------+----------+
! JOHN    !        1 !
+---------+----------+

In relationland however, "tables" do not really exist.  What does exist is what relational theory calls relations.  Tables only enter the picture (pun intended) when relations need to be represented on paper.  And when they do, tables are a very convenient way to depict, to render, to display, a relation on some device, e.g. paper or LCD flatscreen.  (But note that they are not the only possible way !).  That is why Fabian Pascal always says that a table is a picture of a relation.
Also in relationland, the distinction [that SQL failed to make] between 'updatable object' and 'the contents thereof', is properly made.  The 'updatable objects' are referred to as "RELATION VARIABLES" (RELVARS for short), and the 'contents thereof' are referred to as "RELATION VALUES" (RELATIONS for short).

All of this just to illustrate that the "proper" term to use for denoting that concept of "relational table", is " relation".


Now onto "What is a relation", which was presumably the intent of the question originally asked.  I have some stuff on the very same subject on my website, but the treatment there might come across as overly pedantic, and if you can't stand that, well, then it's not helpful.  So I'll be a bit more brief and a bit "looser" in what follows.  Here goes.
Take some sets of values that you deem useful and assign them a name.  For example WE_DAYS is the set {saturday, sunday} and BOOLEAN is the set {true, false}.  From mathematics, you are supposed to know that we can form a cartesian product with those sets.  For example, the cartesian product WE_DAYS X WE_DAYS gives us a set of ordered pairs , where each element of the pair is a value that appears in WE_DAYS : { (saturday,saturday),(sunday,sunday),(saturday,sunday),(sunday,saturday) } .  The cartesian product BOOLEAN X WE_DAYS gives us : { (true,saturday),(true,sunday),(false,sunday),(false,saturday) } .

Two points to note : 
  1. Mathematics typically stays within the bounds of binary relations, i.e. based on the cartesian product of exactly two domains.  For purposes of data management, this is to be generalized to cartesian product of "any arbitrary number of domains", thus giving rise not only to pairs, but also to triplets, quadruplets, quintuplets, etc.  The "poshy" name for this is n-tuples, where n stands for any number such as 2, 3, 4, 5, ... but also possibly for the number 1, and even the number zero !
  2. There is a sense of "ordering" to those n-tuples, and for purposes of data management, this too is undesirable.  We don't want to fill our programs with references to "the second attribute" or some such.  We want to be able to use attribute names instead, in those references, and we do not want to have to bother with any alleged "meaningfulness" of any kind of "ordering".  Any such "meaningfulness" only brings about additional maintenance problems (if a column is removed, all subsequent columns would be "renumbered", and all references to them must be revised), the costs of which are not offset by any material benefit.  And certain laws of capitalist economy dictate that matters be organized so as not to incur those costs :-)
Back to mathematics.  After having defined what a cartesian product is, that discipline then goes on to define what a relation is : a relation is some subset of such a cartesian product.  The two examples [of a cartesian product] previously spelled out, both give rise to sixteen possible relations :
  1. First of all, the cartesian product itself (a cartesian product is itself also a set [of ordered pairs], and any set is by definition a subset of itself).
  2. Four possible relations that have exactly three of the ordered pairs
  3. Six possible relations that have exactly two of the ordered pairs
  4. Four possible relations that have exactly one of the ordered pairs
  5. And finally, the empty set, which is by definition a subset of all sets, thus also a subset of any cartesian product, thus by defintion also always a relation.
A relation in the sense of the relational model of data, is actually the very same thing : it is a subset of a cartesian product of domains, but always with the amendment in mind that the (so-called) "ordered pairs" of such a cartesian product, are not necessarily pairs, and are certainly not "ordered" (but its constituent components are addressable by attribute name instead).
A number of consequences of this definition deserve to be spelled out.
  1. Is there any way for a cartesian product to contain a "pair" (/n-tuple) whose number of constituent components is not equal to the number of domains that the cartesian product was taken with ?  No there isn't.  As a consequence, it there any way for there being any "pair" (/n-tuple) in a relation that does not have exactly n attributes ?  Once again, no, there isn't.  As a consequence, there is no room for such a thing as SQL's NULL in a relation, as defined by the relational model.  (There would be, if a thing such as NULL were a member of one of those sets that the cartesian product was taken over, but that would mean that that kind of thing would actually be a value, meaning that at the very least, such a thing would have to be considered equal to itself.  (Contrast that with SQL.)
  2. Is there a way for a cartesian product to contain "the same pair twice", so to speak ?  No there isn't.  As a consequence, is there any way for a relation to "contain the same n-tuple twice or more, so to speak" ?  Once again, no there isn't.  If a component of a relational engine produces relations as its output, then that means that there cannot possibly be "duplicate rows".

And that really is all there is to say on "what is a relation" ...  The following posts will be related to "what do relations mean", and that very curious possibility of "a subset of a cartesian product of no domains at all" ...

Saturday, August 11, 2012

"What is a Data Model" ...

... triggered by the following question, to which I could no longer resist the itch to write down my own take :

< quote >
I've read a blog post about what really is a data model, as used in the term "relational data model" (RM). It made the following points:

1. The implementation of a data model is a programming language.

2. The RM is not necessary. It is not necessary for developing software solutions, maintaining large shared databases, or any other purpose in the world of software development. Any software solutions that can be developed while employing the RM could be written without it, using other data models.


The first conclusion came from the analysis of Chris Date's definition of a data model:
A data model is an abstract, self-contained, logical definition of the objects, operators, and so forth, that together constitute the abstract machine with which users interact. The objects allow us to model the structure of data. The operators allow us to model its behavior. --C. J. Date, AN INTRODUCTION TO DATABASE SYSTEMS, Addison Wesley, 8th ed., 2003, p 15-16)
It concluded from this that the implementation of a data model is a programming language, whether a general purpose programming language or not.

I'm not sure if data languages (e.g., SQL and Tutorial D) qualify as programming languages. But maybe in a broader sense, we can say that data languages are also programming languages in the sense that we use them  to "program" (i.e., declare and manipulate) our data. So if only the relational data model had been implemented correctly, then the industry would have produced better data languages (i.e., the D languages). Am I right?

I don't know how #2 derives from #1, and the consequences of not using RM were not stated.
< /quote>



Some comments may be helpful.  Let's start with the beginning, and that's the given formal definition :

A data model is an abstract, self-contained, logical definition of the objects, operators, and so forth, that together constitute the abstract machine with which users interact. The objects allow us to model the structure of data. The operators allow us to model its behavior.

This is best illustrated by explaining how the relational model of data is indeed a "data model" (model of data) in this sense.  For example, how is the relational model of data "an abstract, self-contained, logical definition of 'objects and operators' " ?  Well, the essence of the relational model of data is summarized in that famous sentence known as the "Information Principle", at current times usually referred to as the more elaborate, but perhaps more accurate, "Principle of Uniformity of Representation" : "All information in the system shall be represented as values of attributes in tuples in relations.".  What this means is that "The relational model of data shall have relations as its sole building bricks.", or iow, the 'objects', in the case of the relational model, are relations.

Note carefully that the term "relations" is to be understood in its mathematical sense, i.e. a subset of a Cartesian product of domains.  (Also note that mathematics typically deals with binary relations specifically (i.e. subsets of a Cartesian product of exactly two domains), but this is not a necessary restriction and in the case of information/data management even a quite obstructive one.).  Observe that the term "self-contained" must be taken with a bit of a stretch here, since obviously the definition of what the relational model is, relies on concepts borrowed from mathematics.

It was not extremely accurate where I said "shall have relations as its sole building bricks".  In Codd's first papers, he referred to this thing he called "time-varying relations".  He recognized that information in a database varies over time (not a very world-shocking of course).  With the hindsight we may now call ours, it is easy to observe that there is an analogy with what the world of programming calls variables.  For some reason, that observation was only made a bit belatedly.  To cut a long story short : the 'objects' that the relational model is built on are variables that contain relations as their current values.  And just as in programming, where a variable that is declared to contain integers as its current value, is often called an "integer variable", those database variables of the relational model are commonly labeled "relation variables", "relvars" for short.

That covers the 'objects' part.  What about the 'operators' part ?  Well, there the relational model has the following to say :  The operators that the user shall use to manipulate his relational database, are variable assignment for updating, and the operators of the relational algebra for querying.

Wait a second.  I have no other means to update my relational database than through assignment to a variable ?  Surely, that is not true ...  Surely when I do an INSERT, it's really the INSERT that is carried out, and not some kind of 'assignment' of a brand new and completely enumerated relation ???  Well, at the implementation level, surely no DBMS programmer in his right mind is ever going to deal with INSERTs by first erasing a million rows (should be using the correct term "tuples" here though ...), to then re-add them along with the newly inserted row/tuple ...  But at the model level, putting it like that is indeed an accurate way of stating what goes on with database updating.  What does a computer do when it has to carry out the command 'i := i + 1' (i being an integer variable) ?  The machine takes the current value of i into a hardware register, executes an ADD or INCR machine instruction, and stores the new value in the memory location where i resides.  In relationland, (and always speaking at the model level) INSERT <tuples> INTO <relvar> is exactly the same thing as the assignment <relvar> := <relvar> UNION <tuples>.  Take the current value of the variable (which in this case is a relation), compute the relational UNION (analogous to the integer addition) of that with the given <tuples> (the 'increment' value in the integer counterpart), and arrange for that new value to be stored in the place where <relvar> resides.  The fact that the strategies for achieving this result (i.e. the algorithms, the _implementation_) looks totally different than in the integer counterpart, does not matter.  That's the implementation level, not the model level.

As for the querying part, the relational model thus says that the "querying" ("information retrieval") portion of the "manupulation" aspect, shall be done using the operators of the relational algebra (I presume the reader is familiar enough with this, at least as far as the subject of "What is a Data Model" is concerned.  So no further comments, although some people who know me would expect me to have a lot more to say on the subject of RA, and they are right :-) ).

All of these taken together (variables that hold relations as their current values, an assignment operator to update the current value of those variables, and an algebra to do computations on the relations held in those variables) are sufficient to provide the structure and the manipulation aspect that indeed "constitutes an abstract machine with which the user interacts".  Note that it is of course still an "abstract" machine, in that without a concrete implementation, there is of course nothing 'real' or 'material' for any user to interact with ...  (SQL could have been/become such an implementation, but alas it turned out more to be anything but a faithful implementation of this abstract "relational" machine ...)

So far for how the relational model of data is indeed a "model of data" in the given sense.  Let's contrast this with a few others.  The Entity-Relationship So Called Model, for example (ER for short).  What does ER has to offer in the area of structure ?  The answer to that is pretty obvious.  Entities and Relationships.  What does ER have to offer in the area of manipulation ?  Pretty obvious too : nothing at all.  That is the principal reason why ER does not deserve to be called a Data Model, according to the definition given.  That is the principal reason why Date says that ER is just a bunch of rectangles interconnected by lines with fluffy stuff on their end points.  The same is also true of most if not all of those other graphical modeling languages (Object-Role, Dimensional, ...)  Their "value" (and just so it be clear : I'm not one of those who claim that that "value" is equal to zero) is in how well they manage to illustrate structure, but they have nothing to offer in the area of manipulation.

A little side-note is warranted here on what would happen if "models" such as ER did indeed try to add something to address the manipulation feature of a "true" data model.  Well, presumably for an "ER database", it would be equally true that the information it holds will vary over time, and that there is therefore an inevitable need for some kind of variables.  And since there are two distinct "kinds" of structure (entities and relationships), it is not unrealistic to assume that an ER model for manipulation would have to provide variables of type "Entity" as well as variables of type "Relationship".  Updating the database would require both an assignment operation for Entity variables as an assignment operation for Relationship variables.  Information retrieval would, probably inevitably, require operators [making up an algebra] for computations on Entities, as well as operators [making up an algebra] for computations on Relationships.  Implementers would have double trouble.  This is one of the prime reasons why the relational model was such a stroke of genius : all the power that the user needs, comes from a single type of structure (the relation), and a single set of operators (the relational algebra).  Any model that wants to do better, starts with a light years handicap if it wants to have more distinct types of structure than the relational model has.  Implementers have more work to do, more opportunity to make mistakes, testers have more work to do, users have more to learn, longer learning curve, the list doesn't end.  It is quite unlikely that any benefit (if one exists at all) that could derive from departing from Relationland's Information Principle, could ever be big enough to offset these inevitable downsides.

That brings us to that other question "are other data models possible".  An alert reader will already have made up from the foregoing that the answer is indeed "yes".  For example, it is quite possible for there to exist a "Hierarchical model of data", or a "Network Model of data".  What would it take for such Data Models to be legitimally called a Data Model, according to the definition given ?  Always the same answer : they'd have to formally define constructs that say what kinds of structure the database user will be dealing with, and they'd have to formally define constructs that say what kinds of manipulation the database user will have to his avail for working with [values of] those structures.

For the remainder of the discussion, I'll not make any distinction between "hierarchical" and "network" data models, observing that hierarchical database technology was based on [mathematical] trees, that trees are a special class of mathematical graphs in general, and that "network" database technology was (perhaps a bit loosely) based on [mathematical] graph theory in general.  In what little still follows here, I'll call such a model the "Graphical" Model of Data.

Well, what does a graph consist of, structurally ?  It consists of Nodes and Edges.  Only if a graph happens to be, or is known to be, "connected", do we not need to spell out the Nodes explicitly, because we can infer the set of existing Nodes by inspecting all the Edges.  But that's a special case, and of course a Data Model that is supposed to have general applicability, must cover the general case.

And there you have it.  Like any possibly conceivable "Entity Relationship Model of Data", any kind of Graphical Model of Data is inevitably bound to offer two distinct kinds of structure : Nodes and Edges.  The very same disadvantage of double (implementer) trouble and double (user learning) trouble that applies to any "Entity Relationship Model of Data", should one ever come into existence, necessarily also applies to any "Graphical Model of Data", should one ever come into existence.

In fact I'm reminded of the kind of database programming I did day in, day out with the IDMS system, yesterday when I was young (and I'm being "slightly" dishonest calling that 'yesterday' - feels more like hundred years or so :-) ).  It offered indeed two distinct kinds of structure : the "Record" (node) and the "Set" (edge).  Adding a Record required using the STORE keyword.  Adding an edge required using the CONNECT keyword.  Or if the CONNECT was issued automagically, then it required controlling the CURRENT of Record.  Deleting a Record was ERASE, deleting an edge (Set) was DISCONNECT.

Anyway.  Regarding that "should one ever come into existence".  There is one angle, looking from which it might be claimed that the "Graphical" Model of Data exists already.  It might be claimed to exist in the heads of the authors of (various ?) graph-based data management systems that seem to be gaining some popularity these days.  But that's not a match to the vast building of academic study that has been carried out, and spelled out and published, on the Relational Model of Data.  And it's unlikely that any model will any time soon get to match the level of scrutinous inspection the RM has undergone, let alone match the quality and the usefulness of the RM.

And that latter bit remains true even of that crippled implementation of the RM that is known as SQL.



For completeness, I should also be giving my take here on the following statements :

I'm not sure if data languages (e.g., SQL and Tutorial D) qualify as programming languages. But maybe in a broader sense, we can say that data languages are also programming languages in the sense that we use them  to "program" (i.e., declare and manipulate) our data. So if only the relational data model had been implemented correctly, then the industry would have produced better data languages (i.e., the D languages). Am I right?

SQL was originally intended as a mere Data Manipulation Language, not a programming language.  And in fact, the very very first origins built on an assumption that there would even be no commands for database updating in the language !!!  Hence the 'Q' in the name, and also in the name of its predecessor SEQUEL, "Structured English QUEry Language".  One of the very first original ideas around the relational model was that the relational approach was [useful] to create a "relational view" of existing databases (which could perfectly be hierarchical or otherwise in nature), and the updating should be left to the existing technologies as they were ...

Anyway, in later decades, and in particular with the advent of SQL/PSM in the standard, SQL evolved more towards being a full-fledged programming language.  Not that it has been very successful in that respect in practice, but that doesn't change the fact that SQL has indeed actually become much more of a "complete" programming language than it was originally.



I don't know how #2 derives from #1, and the consequences of not using RM were not stated.

#2 (other data models are possible) does in fact not derive in any way from #1 (implementing a Data Model implies creating a programming language), imo.  The two are just two totally independent truths.