Wednesday, September 19, 2012

"Converting SQL to Relational Algebra"

It's that time of the year again.  Introductory courses in relational data management (I'm a bit reluctant to call them "courses in relational theory") are starting again, all over the world, and consequently, students seeking advice and assistance from "professionals" start posing "Converting SQL to Relational Algebra" questions again on various database-related discussion fora, when they are not able to solve assignments given to them in class, or when they are uncertain their solution is correct.

I would actually like any of the teachers who give such assignments to try and explain to me what purpose they hope to be achieving by giving such assignments.  What do students learn from this ?  I recently got involved in such a "Converting SQL to Relational Algebra" question, and one remark the student made, in the discussion that ensued, confirms my feeling regarding this : they learn nothing at all, and "Converting SQL to Relational Algebra" is completely pointless, only adding to the confusion instead of resolving it.  (The remark was, literally, : "we do not either understand why we are learning this".)

Why is this so ?  Well, it's because Relational Algebra is foundational, and SQL is not.  That's because Relational Algebra is completely abstract, and SQL is not.  Not in the same sense that Relational Algebra is abstract.  SQL is abstract with respect to the procedures and algorithms that the implementing engine executes in order to carry out a given SQL command.  But that is not quite as "profound" or "foundational" a level of abstraction as the one the Relational Algebra can be said to have.

SQL is just a syntax for expressing formulae of the RA.  One possible syntax for it.  Of the plethora of distinct possible syntaxes that can be conceived to do the same.  And SQL is an awful syntax for that purpose, come to that.  That's because initially, SQL was firmly rooted in predicate calculus, not in relational algebra.  That SQL can express relational algebra, is only a "coincidental" consequence of the fact that predicate calculus and relational algebra were later proven equivalent.  That historical reason is why, for example, relational difference had to be expressed using WHERE NOT EXISTS (...), and EXCEPT was only introduced in the standard decades later.

Granted, if you teach RA first and only then SQL, you are headed for a shit load of questions, "why did they make that syntax so quirky if the algebra is so simple" ?  The answer is always the same : poor historical reasons.  BTW there's a great book forthcoming that fits this line of teaching perfectly.  But at least they could understand.

And the fact still remains that if students can first manage to understand RA, then they can understand any language that implements or supports it.  They will then understand that, as far as RA operations are concerned, the difference between SQL and, say, Tutorial D, is a mere matter of syntax.  They might understand that the Tutorial D expression 'R1 JOIN R2' is the very same thing as the SQL expressions 'SELECT * FROM R1,R2 WHERE R1.<attr> = R2.<attr>', 'SELECT * FROM R1 JOIN R2 ON ...' and 'SELECT * FROM R1 NATURAL JOIN R2'.  And the reason they would understand, is precisely that their thinking would be in RA, not in SQL or any other specific language.

We urgently need to learn our database designers to think of data in terms of sets (sets of tuples, where those sets represent the extension of a logical predicate, and each tuple represents a proposition), and to think of manipulations on that data in terms of relational algebra.  Not in terms of tables and SQL.

And it means that the teaching should no longer be about "Converting SQL to relational algebra", but instead about "Converting relational algebra to SQL".  The latter, not the former, is the sensible way of doing things.

Tuesday, September 18, 2012

"Why EAV sucks"

We've already seen this one before as an illustration of a cartesian product establishing a tuple type :

I:INT X C:CHAR = {(I:1,C:'A') , (I:1,C:'B') , ...}

and one possible relation deriving from this :

R = { (I:1,C:'A') , (I:1,C:'B') }

or, rewriting those tuples as sets of pairs mapping attribute names to values of arbitrary domains :

R = { {(I,1),(C,'A')} , {(I,1),(C,'B')}}

In tabular form (a fairly typical form for representing relation), this could become something like

IC
1A
2B

Well, tables consist of rows and columns, and with this particular view of a relation, both rows and columns turn out to have an interesting property.

Beginning with the rows.  It is obvious that one row contains precisely one tuple of the relation.  Nothing more and nothing less.  And one tuple corresponds to exactly one logical proposition that it represents.  That logical proposition being a sentence, a statement of fact.  For example, "Yesterday, the average temperature in room A was 1 degrees centigrade.".

Then the columns.  What is common for all the things mentioned in the column labeled 'I' ?  They are all numbers originating from domain INT.  Likewise for the things mentioned in the column labeled 'C'.  In our data manipulation language, this means that if we have a reference to column (/attribute) 'C', we just know that what we're referring to is a CHAR value.  We don't need to typecheck this at runtime.  It's guaranteed by the typechecks that occurred when the tuple was inserted.

Remember these two.  They're important.

A closer look at one of these two tuples themselves now.

{(I,1),(C,'A')}

That's a set of ordered pairs.  The first member of each pair coming from the domain of 'valid attribute names', the second coming from ...   Well from what exactly ?  We can't say that the second member of all pairs here comes from INT, nor from CHAR, nor in fact from any domain we used to build tuple types and relations with.  Rather, all we can say is that it's just some value from some domain, but we don't know which it is.  (We can know which one it is, by inspecting the value of the first member (the attribute name), and then linking that somehow to the original way in which we defined our cartesian product/tuple type.)

Anyway.  A "set of ordered pairs" is a relation, thus a tuple is itself also a relation.  (But do keep the distinction in mind between "relation over the domains INT and CHAR" and "relation over the domains ATTRIBUTENAME and SET_UNION_OF_ALL_DOMAINS").

Anf if a tuple is itself also a relation, we could try out what we get if we wanted to represent it tabularly using the usual technique :

????????????
I1
CA

Compare with the "interesting properties" of the (tabular representation of) the relation.  Does the interesting property of the column still survive ?  No it doesn't.  Unless we find the fact of possibly getting back just anything from any expression in our data manipulation language an "interesting" property.  Does the interesting property of the rows (tuples of a relation) still stand ?  No it doesn't.  It just tells us that there has been a temperature of 1 degrees centigrade yesterday, in some further unspecified room, for example.  (But note that in order to derive so much information, we need to inspect the value of that "first column", and iterate over a set of used attribute names : "if it's C then that value is a room identifier.  No it's not C.  If it's I then that value is an average temperature.  Ah yes that's the one.")

(Aside.  The question marks are there for a reason.  What would you put in their place ?  Can the user of a system based on this paradigm of tuples-as-relations be given the freedom to choose his own preferred replacements for the question marks, in a sense similar to how the designer of a database is at freedom to choose his attribute names, when defining cartesian products aka tuple types ?  End-of-aside.)

One step further in the predictable direction.  Take a look at

??????ATTRIBUTENAMEVALUE
GUID-WXCVBN-AXIOI1
GUID-WXCVBN-AXIOCA
GUID-WXCVBN-AX1OI2
GUID-WXCVBN-AX1OCB

This table was obtained by prepending something very akin to a "tuple id", "entity id" or "object id" to the table, and then using those values for identifying "which attribute values belong together to form a tuple".

Looks like anything you know ?  Rhetorical question.  This is EAV.  Strip the RM from its connection to predicate logic (that is, remove the connection with 'meaning') and strip the RM from the static type checking that is offered by its most foundational building brick, the tuple (that is, remove the most fundamental data integrity feature of all), and what is left is EAV.  As another commenter put it : "disassemble the tuple and amputate the predicate".

Some thought exercises.  Assume you want to design a database like this.  Assume the name in place of the question marks is 'OID'.  Write all the needed constraints on this three-column table to effect the following :

- values corresponding to the ATTRIBUTENAME 'I' must be of domain INT.
- values corresponding to the ATTRIBUTENAME 'C' must be of domain CHAR.
- only complete tuples can be inserted.  That is, if an I or a C is inserted, then so must the corresponding C or I be.
- only complete tuples can be deleted.  That is, if an I or a C is deleted, then so must the corresponding C or I be.


These four constraints all derive naturally from a relational declaration as simple as (two examples)

VAR RELATION {I:INT C:CHAR} IC ;
CREATE TABLE IC (I INT, C CHAR) ;

Which of the two approaches is the easiest and the simplest ?

Further thought exercises.

A constraint must be introduced to the effect that "the I value must be less or equal than the position of the C value in the alphabet."  Write a query to check whether this constraint is currently satisfied by your EAV database, and spell out the strategy you're going to follow to enforce this rule.  Is it simpler or easier or less work involved than the following ? (two examples)

CONSTRAINT IC_RULE1 ALL(IC, I<=INDEXOF(C,"ABCDEFG...XYZ")) ;
ALTER TABLE IC CHECK (I<=INDEXOF(C,"ABCDEFG...XYZ")) ;

A constraint must be introduced to the effect that the I values are unique identifiers for any tuple.  Write a query to check whether this constraint is currently satisfied by your EAV database, and spell out the strategy you're going to follow to enforce this rule, or write out the declarative constraint that will enforce this in your EAV database.  Is it simpler or easier or less work involved than the following ? (two examples)

VAR RELATION {I:INT C:CHAR} IC KEY {I};
ALTER TABLE IC KEY (I) ;

Further thought exercise.  Read this sentence carefully.  Now look at the following EAV structure.

??????ATTRIBUTENAMEVALUE
GUID-WXCVBN-AXIOPOS1
GUID-WXCVBN-AXIOWORDRead
GUID-WXCVBN-AXOOPOS3
GUID-WXCVBN-AXOOWORDsentence
GUID-WXCVBN-AXO0POS4
GUID-WXCVBN-AXO0WORDcarefully
GUID-WXCVBN-AX1OPOS2
GUID-WXCVBN-AX1OWORDthis

I could write that same sentence, in cryptic form, in four lines :

Word 1 in the sentence is Read.
Word 3 in the sentence is sentence.
Word 4 in the sentence is carefully.
Word 2 in the sentence is this.

And EAV even goes beyond that in turning that into 8 lines !  How much work and computation does it involve to reassemble the original sentence from this ?



All of that should be sufficient to illustrate how much veracity there is to the claim that "EAV makes database maintenance easier".  Instead :

If you do not care about the data type of (the values in) your database fields, then EAV is your thing.
If you do not care about the meaning of the contents of your database, then EAV is your thing.
If you do not care about the integrity of your data, then EAV is your thing.

(In each of these cases, "do not care" is supposed to also include the notion that "the user doesn't mind paying all those extra hours you put in solving problems that the DBMS has already solved for you".)

Tuesday, September 11, 2012

"Types, Sets, Tuples, Headings, Relations, a set mess ???"

That's a long title, and it's intended to illustrate how all of that stuff that actually underpins the Relational Model of Data, mathematically speaking, may come across as "one big tangled mess of almost-but-not-quite similar-or-equal concepts all thrown on a big pile".

And the body of this installment is about illustrating how it really is not that "perceived mess".

The following example has been used more than once already in this little corner of the internet.


I:INT X C:CHAR = {(I:1,C:'A') , (I:1,C:'B') , ...}


At this point, I need to take one more step back to basics.  The foregoing definition assumed it to be "known" or "agreed-on" what those domains INT and CHAR precisely were, respectively.  I.e. what its constituent members are.  The thing to note is that those "domains" are sets, and their members are values :


INT = {1 , 2 , ...}
CHAR = {'A' , 'B' , ...}

And that's really all it takes to be a type.  A type is a set of values.  And a value can be whatever we want it to be.  If I have a good use for the set of guitar chords, I can choose that to be a type.  If I have a good use for the set {cloud, oak, armchair} or {scissors, stone, net}, I can choose that to be a type.

(Aside : of course, in a scenario of information exchange, the parties participating in the exchange need to be in full and perfect agreement on what the types are (which values are in it and which are not).  Exchange without such a full agreement is unpleasant surprises in the air and accidents waiting to happen.)


Back to

I:INT X C:CHAR = {(I:1,C:'A') , (I:1,C:'B') , ...}


Obviously, by definition in fact, such Cartesian products are themselves a set.  Are they then also a type ?  The predictable answer is of course 'yes'.  A Cartesian Product is a set of tuples, and if we regard these tuples as being values, then any Cartesian Product of domains constitutes a tuple type.


One step further.  Relations were defined to be any subset of this cartesian product.  Any set is a subset of itself, thus any cartesian product is also a subset of itself, thus any cartesian product is a relation, and since each cartesian product constitutes a tuple type, at least this "maximal" relation ("maximal" subset of the cartesian product of the domains) can also be said to constitute a tuple type.  But all other relations derived from the same cartesian product are also "just sets of tuples".  Likewise, really just any relation can be thought of as "constituting a tuple type".  This holds true even of the empty relation, even though it's fairly counterintuitive to imagine a type that has no values ...  (What could we possibly be using that for ???)

Is all of that important ?  The view of "really just any relation" as itself being, or constituting, a tuple type ?  Not really.  Except perhaps for this.  One could "subdivide" the set of all possible subsets of our cartesian product in three distinct "classes".  The first class consisting of just the subset that is the original cartesian product itself.  The second class being those subsets that can be defined "predicatively", by stating a constraining additional predicate such as, e.g. "I equals 1".  And the third class being "the others".  Pragmatically speaking, this third class can only be defined "by enumeration".  (Formally, there is no real distinction between "by enumeration" and "by predicate", because stating the enumeration is, from the perspective of logic, the very same thing as stating a predicate.)

The third class of subsets is thus characterized by the fact that its definition "looks like", e.g. "It's the tuple type I:INT C:CHAR where either I equals 1 and C equals 'Z', or I equals 5 and C equals 'W'".  Which is a predicate, but one that looks suspiciously much like a full enumeration.

The second class of subsets is thus characterized by the fact that its definition "looks like", e.g. "It's the tuple type I:INT C:CHAR where I equals 1".  A reasonably concise way to define possibly useful subsets/types.

The first class is thus characterized by the fact that its definition is just simply, "It's the tuple type I:INT C:CHAR".  The 'I:INT X C:CHAR' portion is the heading.  The (only) tuple type in this class can thus be said to be characterized by its heading alone.  All other conceivable tuple types with the same heading need an additional constraining predicate to define which tuples are and are not member of the type.  This is the reason why in practice, this particular tuple type is often called "the" tuple type corresponding to some heading (as if there is only one such type).  Formally, this is rather incorrect.  Many tuple types have the same heading, but the one that has all possible tuples, is the most useful, in practice.

The fact that relations are themselves also tuple types, is the source of quite a bit of confusion. The "design dilemma", for example.  (That's the question of whether structures such as 'customer' and 'address' should be conveyed in their own type or not.)  It's not a dilemma at all.  Defining a Customer or Address class in an OO system constitutes defining a type.  The tuple type that corresponds to it is a type too.  Whether or not to give the type a name, is another question, and one that will mostly be moot at that.

On to relations.  If relations are themselves values, then what is their type ?  Well, a type is the set of all its contained values.  Hence, a relation type is a set of relations.  The construction of such a set of relations can be done using the mathematical concept of powersets.  A powerset is the set of all possible subsets of a given set.  So if we :
- start with a tuple type such as the one in our running example,
- enumerate all the possible subsets of that tuple type (that's equivalent to enumerating all the possible relations with that heading)
- and collect all those subsets in a new set,
then this new set is our relation type.

This relation type too is characterized by its heading (and its heading alone).  As with tuple types, this relation type is not the only relation type of this heading, though.  Any subset of the tuple type (e.g. all the tuples that have I>=1, thus ruling out negative numbers), when subjected to the same procedure, will in turn give rise to some relation type (relations with the specified heading and where all its tuples satisfy the predicate that (e.g.) I>=1).  Likewise, subsetting the relation type using some predicate (for example, retaining only "the relations that have at least three tuples") gives rise to some other new relation type.

Typically, however, relational systems do not exploit this property of relation and tuple types.  Instead they give the user the possibility to specify predicates that constrain which relations and tuples can validly appear in which places.  But a system that offered named tuple/relation types, with a constraining predicate, would fundamentally (mathematically) not really be different (in what it allows the user to achieve).