Friday, January 4, 2019

Some further analysis and observations on a statement about CHECK constraints

"It is not a-priori knowledgeable whether and which of the other rules are expressible with the CHECK syntax in a given SQL dialect"

Of course it's never answerable with full 100% accuracy, but it is possible to come pretty damn close.

(1) From the SQL standard, book 1, "Framework" :

"A table check constraint specifies a search condition. The constraint is violated if the result of the search condition is false for any row of the table ( ... irrelevant amendment for NULL cases here ...)."

Considering that the CHECK constraint is supposed to be satisfied by every single row in isolation,
that all currently present rows satisfy the CHECK because they have been CHECKed when they were inserted (and re-checked when updated),

it seems we can conclude that the CHECK expression defines the condition to be satisfied by all newly inserted rows (where an UPDATE is to be considered the semantic equivalent of a "simultaneous" DELETE (of the old row value) and INSERT (of the new row value) ).  And subsequently, that SQL will execute CHECK constraints only for INSERT-type updates (to repeat, UPDATE is also an INSERT-type update because some INSERT is semantically implied - last time I'll explicitly repeat this).   This conclusion is factually wrong unless certain conditions are satisfied - see last additional note - but it is one that the SQL engineers have indeed drawn and hence SQL implementations indeed behave as described above.

(2) Ceri/Widom have shown ( https://pdfs.semanticscholar.org/850b/e943c3b45f32de203a277dfbf556ffbbda21.pdf - "deriving invalidating operations") that it is possible to compute from a formally stated database constraint, the set of types of update operation that can potentially violate the constraint.  "Type of update operation" standing for : INSERT INTO X, DELETE FROM Y, ... which can be formalized as (update-type, relvar) pairs (equate "relvar" with "table" if you're not familiar with the term - it's close enough for the purposes of this post).

Meaning that if that set includes one single pair in which update-type is DELETE, then CHECK constraints alone cannot possibly suffice to enforce the database constraint (business rule) entirely.

(Aside : this is the very reason why SQL has FKs as a distinguished concept and "REFERENCES" syntax to define them.  FKs can indeed be violated by a DELETE-type operation and there was nothing else in the language to fire integrity checks execution from a DELETE-type operation.  So the SQL creators addressed the problem by introducing a dedicated concept that the run-time integrity check could be "attached to".  Date/Darwen have demonstrated at length that a truly relational system that supports *all* types of constraint has no need for dedicated "REFERENCES" syntax, because in their own words, "the longhand often turns out shorter than the shorthand".)

(3) That leaves only the database constraints which can only ever be violated by some INSERT-type update.  This is to be split in two cases :
(3a) the only variables in the search condition in the CHECK constraint, are attribute/column values of the newly inserted tuple/row specifically (that is, the outcome of the CHECK is determined by the newly inserted tuple/row value exclusively), then in SQL this constraint will be supported by the engine, unless the engine does not offer CHECK support altogether (a minority I expect).
(3b) there is at least one variable in the search condition other than just attributes/columns from the newly inserted tuple/row.  Something related to other tuples/rows from the same relvar/table, or tuples/rows from other relvars/tables altogether.  These cases are addressed by SQL feature F671, "subqueries in CHECK constraints" so whether your given engine supports these cases is determined by whether your given engine supports SQL standard feature F671 (even to this very day, I believe none of the big dogs has it let alone the smaller ones).

(4) (Just to give an indication of how big the percentages of each case are in a real-world db :)  The SIRA_PRISE catalog effectively contains all applicable business rules as a declared constraint.  Disregarding type constraints (which "derive" directly from the declaration of a heading and which limits the set of values to those within a known data type), the counts are as follows :

Tuple constraints (single-row CHECK constraints / 1OP 2OP properties in FP parlance) :  TBC
Relvar constraints (single-table constraints / 3OP properties in FP parlance) :  TBC
Database constraints (multi-table constraints / 4OP properties in FP parlance) :  TBC
Database constraints that are DELETE-sensitive (i.e. not supportable by CHECK) :  TBC
Database constraints that are not DELETE-sensitive (i.e. supportable by CHECK) :  TBC

Relvar constraints that are keys :  TBC
Relvar constraints that are not keys :  TBC

Database constraints that are FKs :  TBC
Database constraints that are not FKs :  TBC



Additional notes :

(2)
There is some give and take, stretch if you will, in the analysis mentioned in (2).

Take for example a constraint "total amount must at all times exceed (or be equal to) 5".  Where "total amount" is the summation of the values in some attribute/column of some [subset of] rows in some relvar/table.  On the face of it one might figure that INSERT-type updates are never going to fail that check.  But that is only so if the rows are guaranteed never to contain negative values !  So the DBMS' constraint analysis procedure taking this conclusion is predicated on the system being able to detect that guarantee (presumably from some other declared constraint, but potentially also from a type definition), and that's nowhere near trivial for a DBMS to do (because *this* is the point where "the DBMS genuinely understanding the data" enters the picture - for some sense of "genuinely understanding").

(3)
Some designers try to bypass the absence of F671 by hiding the access to other tables in a user-defined function so that their search condition looks more like "WHERE ... MY_UDF(inserted_row) ..." which on the face of it makes things look like this is a (3a) case, but *semantically* of course it still isn't.  And they often find themselves trapped in / caught by unanticipated timing-and-sequence-related dependencies that the DBMS keeps for itself (rightly so at that).  If/when that happens that's usually when another new question appears on Stackoverflow :-) .

(3b)
Even if you have a system that supports F671, there is still extremely sensitive transformation to be performed by the database designer.  You have one single business rule, but you have >1 relvar/table, the INSERT to each of which might violate it.  As a database designer, you're still obliged to rewrite the single business rule into the right expression to check for when the INSERT happens to this table, or to that other table, and so on for each table involved (to be precise : each table in the set of (INSERT, table) pairs you got from the analysis phase in (2) - and note that this therefore requires a tool that will perform that analysis in (2) !).  These transformations are so tedious, difficult (literally a minefield full of traps) and error-prone that I bluntly wrote in my NOCOUG article on the subject [something to the effect] that "Ordinary humans can't do it, period".

(last note)
The original text of this post also had "CHECK constraints can only ever be violated for newly INSERTed rows at the time of their insertion" as an argument leading to the conclusion that "the CHECK expression defines the condition to be satisfied by all newly inserted rows".  We already mentioned that this conclusion is factually wrong, and here's why.

Suppose a row r1 is inserted in table T1, and it satisfies the CHECK constraint.  But the CHECK constraint yielding TRUE is predicated/dependent on the existence of some particular row r2 in some table T2 (T2 may be the same as T1 or different).  That is, the CHECK constraint is a least a cross-row constraint and possibly a cross-table one.  In FP parlance, it corresponds to a 3OP or a 4OP.  Removing all rows from T2 [that make the particular predicate for r1's CHECK constraint yield TRUE] will obviously cause r1 in T1 to suddenly fail its CHECK constraint, which should obviously require the removal -the DELETE- to be rejected by the DBMS.  So it is not always true that "CHECK constraints for a row rx can only ever be violated upon insertion of that very row value rx".  This is only so if the CHECK constraint's outcome does not depend on the either-or-not existence of any other row in any table.  That is why F671 is a separate SQL feature, and moreover one that is not supported [producing correct and reliable results] by any SQL engine I know : the analysis [of all the distinct ways in which r1's CHECK constraint outcome is dependent on any of the involved other rows in any of the involved other tables in the database] is just too hard to make, especially so in the calculus-based language that SQL is.