Tuesday, January 29, 2013

"Converting SQL to Relational Algebra", part II

When I wrote the first posts in this blog, it seemed to generate traffic from places I found rather curious, until it occurred to me that search engines are not entirely unlikely to present "Relational Model" as a useful search result to people who are searching for a Relation with a Model ...

Likewise, the former post on the same subject as this one, gave quite some traffic, and it seems not unreasonable to assume that most of that traffic was from students who were looking for info on how to solve "Convert SQL to RA" problems, and who were not looking for my philosophical agonies on why these problems are even being taught ...

As somewhat of an apology to all those disappointed students, and to prevent further disappointments in the future, a concise set of guidelines, in the form of SQL_KEYWORD to RA_OPERATOR mappings, and whatever additional comments I think might be useful.

SELECT maps to (any combination of) PROJECT, RENAME and EXTEND.

If the select clause involves exclusively column names of the table defined in the FROM clause, then PROJECT is all that is involved.

If the select clause involves constructs such as <columnname> AS <othercolumnname>, then a RENAME is involved as well.

If the select clause involves scalar expressions (quantity+1, HOUR(<colname>), ...), then an EXTEND is involved.  Note that if such scalar expressions are not followed by an AS <colname> construct, then there is no relational equivalent for this SQL, because the SQL yields an unnamed column, and the relational model and/or the relational algebra do not allow unnamed attributes.  Note also that if the expression "quantity + 1" is used in a SELECT clause (and quantity is a column of the table in the FROM clause), then the EXTEND operation by itself will not "project away" the quantity attribute.  Don't forget to explicitly PROJECT away such attributes in the RA formulation.

In all cases, observe that Relational Algebra does not allow duplicate rows, but SQL does.  If the SQL

SELECT QUANTITY FROM ...

could possibly give rise to the same quantity appearing multiple times in the result table, then there simply isn't any relational-algebra equivalent of this query.  In such cases, for the query to even have a relational-algebra equivalent, it is required that the SELECT is a SELECT DISTINCT.


FROM maps to CARTESIAN PRODUCT, NATURAL JOIN, or just nothing at all.

It maps to nothing at all if (and I think only if) its comma-separated list of table expressions has cardinality one.  In that case, the table expression simply defines the input argument to the "surrounding" [combination of] PROJECT/RENAME/EXTEND.

It maps to NATURAL JOIN between pairs of table expressions (say, T1 and T2), if and only if, for all the columns that have the same name in T1 and T2, there is an equality condition in the WHERE clause (WHERE T1.<colx> = T2.<colx>).

In all other cases, it maps to CARTESIAN PRODUCT.  Note that in CARTESIAN PRODUCT, the tables involved are not allowed to have any column names in common.  You might need to introduce additional RENAMEs in order to resolve/do away with any such column name overlappings.  Also note that whatever you write in front of a dot, does not magically become part of the column name itself.  Not in SQL, and, well, relying on it in some RA notation isn't a very attractive idea either.



WHERE maps to RESTRICT, or SEMIMINUS (aka ANTIJOIN), or SEMIJOIN.

It maps to RESTRICT if the WHERE clause involves boolean expressions comparing columns of the table against each other, or against literal values.  E.g. WHERE <colname1> != <colname2>, WHERE HOUR(<colname>) BETWEEN 8 AND 18, ...

It maps to SEMIJOIN if the WHERE clause involves an EXISTS(...) construct.  The nature of the inner SELECT embedded in that clause will give you the second argument for the SEMIJOIN.  Note that this inner SELECT might contain a WHERE clause itself, and that the nature of this WHERE clause may dictate that additional RENAMEs might have to be applied to this second argument.  For example,

SELECT ... FROM T1 WHERE EXISTS ( SELECT * FROM T2 WHERE T2.A = T1.B)

requires a RENAME to be applied to T2, renaming A to B, in order for RA's SEMIJOIN to expose the same behaviour as the SQL EXISTS(...).

And finally, WHERE maps to an ANTIJOIN if the WHERE clause involves a NOT EXISTS(...) construct.  All remarks for SEMIJOIN apply here as well, mutatis mutandis.

If >1 of these constructs are involved in the WHERE clause, then these constructs will have a logical connective between them.  Split the overall query in as many parts as are needed (one for the scalar conditions, one for each EXISTS and one for each NOT EXISTS).  Each logical connective so eliminated becomes a NATURAL JOIN or INTERSECTION if the connective was AND, and becomes a UNION if the connective was OR.  The nesting of these INTERSECTIONs and UNIONs is as dictated by the parentheses and/or the precedence rules between the AND/ORed expressions in the SQL.



GROUP BY ...  HAVING ... maps to nothing at all.  Or more precisely, if you are studying for an exam, do as your course tells you, and try to forget it as soon as possible after you've passed the exam, because whatever you were told in your course/classes was probably about the biggest crap one could imagine.

The same applies if you don't see any GROUP BY ..., but your SELECT clause has an aggregate function in it, like, say, SELECT SUM(QUANTITY) FROM MYTABLE.  That's logically the same as having an "empty" GROUP BY clause.

There do exist definitions/versions of Relational Algebra that can properly handle all this kind of aggregation-like functionality, but it is unlikely that you have been taught these.  So, do as your course tells you and forget it ever after.



UNION maps, unsurprisingly, to UNION.  But do note that basically the same remark applies as with SELECT/projection : Relational Algebra does not allow duplicate rows.  So if your SQL UNION could possibly give rise to duplicate rows, then this UNION query simply does not have any relational-algebra equivalent.



Modern SQL also has operators such as EXCEPT, INTERSECT, NATURAL JOIN, JOIN ...ON ...  These map to relational difference, relational intersection, natural join, or some combination of natural joins and cartesian products, mostly in obvious ways.



FROM clause missing ...  That's not standard SQL, but some products allow it (and as a consequence, some courses may even teach it).  In principle, SELECT statements without a FROM clause are used to denote table literals.  Use whatever notation your RA course used for denoting relation literals.

No comments:

Post a Comment