Tuesday, January 29, 2013

"Converting SQL to Relational Algebra", part II

When I wrote the first posts in this blog, it seemed to generate traffic from places I found rather curious, until it occurred to me that search engines are not entirely unlikely to present "Relational Model" as a useful search result to people who are searching for a Relation with a Model ...

Likewise, the former post on the same subject as this one, gave quite some traffic, and it seems not unreasonable to assume that most of that traffic was from students who were looking for info on how to solve "Convert SQL to RA" problems, and who were not looking for my philosophical agonies on why these problems are even being taught ...

As somewhat of an apology to all those disappointed students, and to prevent further disappointments in the future, a concise set of guidelines, in the form of SQL_KEYWORD to RA_OPERATOR mappings, and whatever additional comments I think might be useful.

SELECT maps to (any combination of) PROJECT, RENAME and EXTEND.

If the select clause involves exclusively column names of the table defined in the FROM clause, then PROJECT is all that is involved.

If the select clause involves constructs such as <columnname> AS <othercolumnname>, then a RENAME is involved as well.

If the select clause involves scalar expressions (quantity+1, HOUR(<colname>), ...), then an EXTEND is involved.  Note that if such scalar expressions are not followed by an AS <colname> construct, then there is no relational equivalent for this SQL, because the SQL yields an unnamed column, and the relational model and/or the relational algebra do not allow unnamed attributes.  Note also that if the expression "quantity + 1" is used in a SELECT clause (and quantity is a column of the table in the FROM clause), then the EXTEND operation by itself will not "project away" the quantity attribute.  Don't forget to explicitly PROJECT away such attributes in the RA formulation.

In all cases, observe that Relational Algebra does not allow duplicate rows, but SQL does.  If the SQL

SELECT QUANTITY FROM ...

could possibly give rise to the same quantity appearing multiple times in the result table, then there simply isn't any relational-algebra equivalent of this query.  In such cases, for the query to even have a relational-algebra equivalent, it is required that the SELECT is a SELECT DISTINCT.


FROM maps to CARTESIAN PRODUCT, NATURAL JOIN, or just nothing at all.

It maps to nothing at all if (and I think only if) its comma-separated list of table expressions has cardinality one.  In that case, the table expression simply defines the input argument to the "surrounding" [combination of] PROJECT/RENAME/EXTEND.

It maps to NATURAL JOIN between pairs of table expressions (say, T1 and T2), if and only if, for all the columns that have the same name in T1 and T2, there is an equality condition in the WHERE clause (WHERE T1.<colx> = T2.<colx>).

In all other cases, it maps to CARTESIAN PRODUCT.  Note that in CARTESIAN PRODUCT, the tables involved are not allowed to have any column names in common.  You might need to introduce additional RENAMEs in order to resolve/do away with any such column name overlappings.  Also note that whatever you write in front of a dot, does not magically become part of the column name itself.  Not in SQL, and, well, relying on it in some RA notation isn't a very attractive idea either.



WHERE maps to RESTRICT, or SEMIMINUS (aka ANTIJOIN), or SEMIJOIN.

It maps to RESTRICT if the WHERE clause involves boolean expressions comparing columns of the table against each other, or against literal values.  E.g. WHERE <colname1> != <colname2>, WHERE HOUR(<colname>) BETWEEN 8 AND 18, ...

It maps to SEMIJOIN if the WHERE clause involves an EXISTS(...) construct.  The nature of the inner SELECT embedded in that clause will give you the second argument for the SEMIJOIN.  Note that this inner SELECT might contain a WHERE clause itself, and that the nature of this WHERE clause may dictate that additional RENAMEs might have to be applied to this second argument.  For example,

SELECT ... FROM T1 WHERE EXISTS ( SELECT * FROM T2 WHERE T2.A = T1.B)

requires a RENAME to be applied to T2, renaming A to B, in order for RA's SEMIJOIN to expose the same behaviour as the SQL EXISTS(...).

And finally, WHERE maps to an ANTIJOIN if the WHERE clause involves a NOT EXISTS(...) construct.  All remarks for SEMIJOIN apply here as well, mutatis mutandis.

If >1 of these constructs are involved in the WHERE clause, then these constructs will have a logical connective between them.  Split the overall query in as many parts as are needed (one for the scalar conditions, one for each EXISTS and one for each NOT EXISTS).  Each logical connective so eliminated becomes a NATURAL JOIN or INTERSECTION if the connective was AND, and becomes a UNION if the connective was OR.  The nesting of these INTERSECTIONs and UNIONs is as dictated by the parentheses and/or the precedence rules between the AND/ORed expressions in the SQL.



GROUP BY ...  HAVING ... maps to nothing at all.  Or more precisely, if you are studying for an exam, do as your course tells you, and try to forget it as soon as possible after you've passed the exam, because whatever you were told in your course/classes was probably about the biggest crap one could imagine.

The same applies if you don't see any GROUP BY ..., but your SELECT clause has an aggregate function in it, like, say, SELECT SUM(QUANTITY) FROM MYTABLE.  That's logically the same as having an "empty" GROUP BY clause.

There do exist definitions/versions of Relational Algebra that can properly handle all this kind of aggregation-like functionality, but it is unlikely that you have been taught these.  So, do as your course tells you and forget it ever after.



UNION maps, unsurprisingly, to UNION.  But do note that basically the same remark applies as with SELECT/projection : Relational Algebra does not allow duplicate rows.  So if your SQL UNION could possibly give rise to duplicate rows, then this UNION query simply does not have any relational-algebra equivalent.



Modern SQL also has operators such as EXCEPT, INTERSECT, NATURAL JOIN, JOIN ...ON ...  These map to relational difference, relational intersection, natural join, or some combination of natural joins and cartesian products, mostly in obvious ways.



FROM clause missing ...  That's not standard SQL, but some products allow it (and as a consequence, some courses may even teach it).  In principle, SELECT statements without a FROM clause are used to denote table literals.  Use whatever notation your RA course used for denoting relation literals.

Tuesday, January 8, 2013

A letter by Carl Hewitt

Recently, an article by Erik Meijer entitled "all your databases are belong to us" created a bit of commotion among Relationlanders.  One Carl Hewitt subsequently wrote a letter in support of Meijer's article.  A few observations and personal thoughts about Hewitt's letter.

Hewitt writes :

"Relational databases have been very useful in practice but are increasingly an obstacle to progress due to the following limitations:"

This clearly implies that mr. Hewitt believes that at least once upon some time, a relational database has actually ever existed somewhere, and he has been able to observe, directly or indirectly, that that database was "useful in practice".  Considering the overwhelming evidence that SQL is not relational, I wonder what "relational database" mr. Hewitt has so observed to be "useful in practice".  Anyway.  Is mr. Hewitt unaware of the difference between "relational technology" and "SQL technology" ?  Or does he consider the difference (and its consequences) between the two so meaningless and futile that it does no harm at all to gloss over that difference and speak of SQL as if it were indeed relational ?

Besides.  Looking at his arguments in support of his claim (inexpressiveness of the RA, for example), one cannot help but wonder if mr. Hewitt is even aware of the difference between a "database" (that's the word he used) and that thing that most knowledgeable people usually use the term "DBMS" for ...  Can "inexpressiveness of the RA" possibly a shortcoming of "an organized collection of data" (if so, how ?), or could it only possibly be a shortcoming of "the software system, the foundations of which are in RA, used to manage the collection of data" ?

What does that tell you about how thorough, accurate and meticulously precise mr. Hewitt tries/bothers to be in his published writings ?



Hewitt writes :

"Inexpressiveness. Relational algebra cannot conveniently express negation or disjunction, much less the generalization/specialization connective required for ontologies;"

Codd's Relational Algebra was proven equivalent to predicate calculus, no ?  So that means that Codd's RA can express both negation and disjunction, right ?  And subsequent definitions of RA that emerged over time (think, for example, of the addition of a transitive closure operation) did not exactly remove the MINUS operator from the existing algebra, right ?  So that indicates that the key operative word in the claim is that vague qualification "conveniently", right ?  Using such a word without being precise about its intended meaning, is just cheap handwaving.

Anyway, the RA has MINUS (and its nephew SEMIMINUS, aka antijoin), and Relationlanders have known for over 40 years that this operator perfectly suits the purpose of expressing any predicate like "... and it is not the case that ...".  It remains an open question what mr. Hewitt thinks is "inconvenient" about writing things such as "<xpr1> MINUS <xpr2>" in, say, Tutorial D code.

Also, there is nothing to stop anyone from defining an algebra that has a "complement" operation (well, so long as all the domains/types are finite, presumably).  This algebraic operation by itself is the exact relational counterpart of the logical operation of negation, taken by itself.  Having to actually compute complements will be contraindicated in most circumstances, as it will typically involve actual computation of what is known in relationland as the "universal relation" for a given heading.  All of that is probably exactly the reason why Codd did not want to include such a "complement" operation in his algebra.

At any rate, I'm still left wondering what mr. Hewitt's problem is here.



Hewitt writes :

"Inconsistency non-robustness. Inconsistency robustness is information-system performance ..."

Note very carefully that Hewitt's complaint here is that the RM lacks "inconsistency robustness", which he then defines to be a performance characteristic.  Performance is not a characteristic of the model.  Anyway. Once writers start going down this alley, readers can already suspect the kind of , eurhm, "talk" that is about to follow ...



"... in the face of continually pervasive inconsistencies, a shift from the once-dominant paradigms of inconsistency denial and inconsistency elimination attempting to sweep inconsistencies under the rug. In practice, it is impossible to meet the requirement of the Relational Model that all information must be consistent, but the Relational Model does not process inconsistent information correctly. ..."

"Inconsistent" in the context of data[base] management, means the presence of information that is in violation of some stated rule that is supposed to apply to the database.  Or iow, that accepting the "inconsistent" information in the database, makes the proposition that states that the violated rule holds, a false one.  Or iow, the proposition that states that the rule holds in the database, is in contradiction with the "inconsistent" information.  Or iow, accepting inconsistent information in the database is tantamount to accepting contradictions.

And I've been told that it is a proven property that you can prove really just anything from a contradiction.

In RM, it is "possible" to consider "inconsistent information", just like in 2-valued propositional and/or predicate logic, it is "possible" to consider contradictory propositions/predicates.  Querying an RM system that holds "inconsistent information" is like applying the rules of logical reasoning to a set of contradictory axioms/premisses.  And blaming the RM for "not processing inconsistent information correctly", is like blaming logic for "not dealing with contradictions correctly" (where by 'correctly', it is implied that it should be something other than just 'false', of course).




"... Attempting to use transactions to remove contradictions from, say, relational medical information is tantamount to a distributed-denial-of-service attack due to the locking required to prevent introduction of new inconsistencies even as contradictions are being removed in the presence of numerous interdependencies;"

Anyone who is familiar with the RM, and also with how it is typically criticized, will recognize this immediately as criticizing the model on grounds of implementation issues (locking and transactions), which are orthogonal to the model.  Typical.



Hewitt continues :

"Information loss. Once information is known, it should be known thereafter;"

Should it really ?  I dispute that.  A requirement to never ever ERASE or DELETE or REMOVE just anything, will inevitably bring us to the point where planet earth does not have enough atoms to store all our stuff.

At any rate, there is absolutely nothing in the RM that prevents any database designer from defining structures that keep a record of "which information was known to the database owner during which period of time" and/or "at which point in time the database owner regarded this particular piece of information as no longer relevant and removed it from his operational system".  Even SQL has included features to support such stuff in the new 2011 standard.

And in the end, when to DELETE a piece of information, should be at the user's discretion (if regulatory bounds apply, then that user should of course be staying within those bounds, if that wasn't obvious), not at the model's discretion.



Hewitt still hasn't finished :

"Lack of provenance. All information stored or derived should have provenance;"

There is absolutely nothing in the RM to stop a DBMS user from defining structures that record exactly the kind of "provenance" information that Hewitt is talking about (whatever that may be), there is nothing in the RM to stop a DBMS designer from building facilities to automagically populate such structures, and there is nothing in the RM to stop a DBMS user from using such facilities.

Nor should there be any such thing in the RM.



And on and on it goes :

"Inadequate performance and modularity. SQL lacks performance ..."

So once again he unambiguously states that he thinks that "The RM is obsolete" (that's what the title says) because "SQL lacks performance".  OMFG.



"... because it has parallelism but no concurrency abstraction. Needed therefore are languages based on the Actor Model (http://www.robust11.org) to achieve performance, operational expressiveness, and inconsistency robustness. To promote modularity, a programming language type should be an interface that does not name its implementations contra to SQL, which requires taking dependencies on internals."

Is the term "programming language type" used here with the same meaning as the term "data type" in relationland ?  If so, then the last sentence seems to demand nothing else than that which relational advocates have been demanding for decades already : that the relational model should and must be "orthogonal to type", that is, that it is not up to the model to prescribe which data types should exist/be supported, and which shouldn't.

Of course, relationlanders have known for a long time already that SQL basically flouts that idea, and that its attempts at supporting -in full- the notion of abstract data types are quite crippled and half-baked.  But apparently it does indeed seem to be the case that mr. Hewitt mistakenly equates "the relational model" with SQL.



And Hewitt concludes :

"There is no practical way to repair the Relational Model to remove these limitations."

Well, this is the first time I can agree with something Hewitt says.  Sadly, as far as I can tell, "these limitations" are not limitations that derive from the Relational Model, rather they seem to derive from his limited understanding thereof.  And indeed there is no repairing a piano when the problem is in its player.