My personal take on ...: 2019

Wednesday, August 28, 2019

Codd's normal form debunked

The title says it all. Following is a bit of commentary on what Codd wrote in section 1.4 of his 1970 paper, "A Relational Model of Data for large shared data banks". That particular section introduces the notion of a "Normal Form", what benefits it allegedly achieves, and why it is allegedly always achievable. The "Normal Form" introduced by Codd here, later continued its life as "First Normal Form", when functional dependency theory emerged/evolved and led to additional normal forms.

Codd begins the section with

"A relation whose domains are all simple can be represented in storage by a two-dimensional column-homogeneous array of the kind discussed above. Some more complicated data structure is necessary for a relation with one or more nonsimple domains. For this reason (and others to be cited below) the possibility of eliminating nonsimple domains appears worth investigating."

The observations Codd makes about "representation in storage" are correct, if one considers the timeframe this was set in (1970, to repeat), the vendor he was working for at the time and the kind of IT segment that vendor's customers were mainly situated in (the 'B' in IBM stands for 'Business'), and the kind of languages that reigned supreme in that particular IT segment (COBOL and PL/1 e.g.).

But "representation in storage" is clearly an implementation detail, a consideration at the physical/implementation level, and for this very reason should never ever have been given the weight Codd gives it here as an argument for "striving for normal form" (which is a concern situated at the logical level). Basically, all he's saying here is "we must always strive for normal form because then we can keep the implementation simpler". That is in fact highly curious, given that in the paragraph immediately preceding this whole section, Codd acknowledges the possibility of "nonsimple domains" :

"Nonatomic values can be discussed within the relational framework. Thus, some domains may have relations as elements. These relations may, in turn, be defined on nonsimple domains, and so on."

and concludes that paragraph with the observation that

"Much of the confusion in present terminology is due to failure to distinguish between ... and between components of a user model of the data on the one hand and their machine representation counterparts on the other hand".

To see that -worthwhile- observation followed by a recommendation to strive for some "normal form" in "the user model of the data" but to see that recommendation supported only by arguments regarding "machine representation counterparts" (we'll get back to that "and others to be cited below" below) is highly ironic/dubious to say the least.

The explanatory employee example with nested relations jobhistory and salaryhistory is clear and correct enough, but the really interesting part are the few paragraphs immediately following that :

"If normalization as described above is to be applicable, the unnormalized collection of relations must satisfy the following conditions:
(1) The graph of interrelationships of the nonsimple domains is a collection of trees.
(2) No primary key has a component domain which is nonsimple."
The normalization procedure Codd described is essentially based on applications of what in the TTM world has come to be known (+-) as the UNGROUP operator. (Caveat : TTM defines UNGROUP as an algebraic operator operating on relation values and yielding other relation values, in the context of the Normal Form the operation must be seen as operating on ***relational schemas*** and yielding ***other relational schemas***.)

TTM's UNGROUP is known not to be information-preserving if one of the RVA's being UNGROUPed is part of the key of the relation subjected to UNGROUP. (Information-preserving in the sense that there is some other operator invocation (or nesting thereof) that is guaranteed to return the original from the UNGROUP result, in all possible cases.) Codd correctly understood that and thus correctly identifies (2) as a precondition for his normalization procedure to be applicable : you cannot "UNGROUP" or "normalize" an RVA/nonsimple domain away if the concerned attribute is part of the key.

And the most interesting part is the concluding sentence :

"The writer knows of no application which would require any relaxation of these conditions."

and this is clearly presented as an argument to the effect that "we need not worry that this is going to pose any real restriction in any real-world scenario". (In fact, he's almost plain saying "I never heard about or saw any such application, therefore it must be that they don't exist".) Well too bad for Codd, but this writer does "know of such applications", and here's the most compelling one :

Suppose we wanted to put all the information we now get from javadoc in a relational database. That would e.g. provide us with a simple means to have questions answered such as "give me all the classes X that have a superclass Y and where that superclass has a static method with the same method signature as any nonstatic method of X". (Just a matter of writing the query - try to achieve same using javadoc-generated doco as it stands right now.)

Observe that in java, "method signature" (within a given class) consists of the method name plus the total of all the method parameter type declarations, the latter being representible at first instance as an array of type declarations, e.g. an array [int, Date, boolean]. So in our relational schema for "methods" we'd have an attribute 'methodName' (say of type String), and an attribute 'signature' of a type that has to be at least semantically isomorphic to "array of type declarations". The canonical way to turn arrays into relations is by introducing an attribute 'ordinal' and denoting each array element as a tuple with (a) the correct value for 'ordinal' and the array element itself for some attributename chosen appropriately to reflect the content :

REL{TUP{ordinal 3; typename "boolean"}TUP{ordinal 1; typename "int"}TUP{ordinal 2; typename "Date"}TUP{...}}

In tabular form :

+------------+-----------------------------------------------+
! methodname ! signature                                     !
+------------+-----------------------------------------------+
! "doStuff" ! +---------+--------------------+              !
!            ! ! ordinal ! typename           !              !
!            ! +---------+--------------------+              !
!            ! !       3 ! "boolean"          !           !
!            ! +---------+--------------------+              !
!            ! !       1 ! "int"              !              !
!            ! +---------+--------------------+              !
!            ! !       2 ! "Date"             !              !
!            ! +---------+--------------------+            !
+------------+-----------------------------------------------+
! "doStuff" ! +---------+--------------------+            !
!            ! ! ordinal ! typename           !              !
!            ! +---------+--------------------+              !
!            ! !       1 ! "Date"           !              !
!            ! +---------+--------------------+            !
+------------+-----------------------------------------------+

Using this structure to keep our javadoc in a relational database, it's clear the key of the relvar for methods information must be {methodname signature}. And thus it doesn't satisfy the preconditions for Codd's normalization procedure.

(In recent java, there's also this "varargs" feature that would have to be covered, but we've made abstraction of that in our example because it only gives rise to further complication without being material to the point we're making about the normalization procedure.)

The final remaining bits and pieces in the 1970 paper relate to that "and others cited below" (and thus allegedly lay out more reasons why "normal form" is something to be strived for) :

"The simplicity of the array representation which becomes feasible when all relations are cast in normal form is not only an advantage for storage purposes but also for communication of bulk data between systems which use widely different representations of the data"

Once again we must remind ourselves of the hardware speeds of the day to put this remark in its proper context : things like serialization streams weren't used for "bulk communication" because it inherently creates a need for "parsing" activity on the receiving end, and as we all know, "parsing" has always been and is still a comparatively costly operation. 40 years of Moore's law has brought that cost down to acceptable levels by now, so it's only rarely still an issue these days, but in the early seventies that cost was just prohibitive. For example, these days it's typically nowhere near an issue to communicate stuff like our javadoc methods relations in the form of an XML string

<method><name>doStuff</name><signature><relation><tuple><ord>1</ord><type>Date</type></tuple></relation></signature></method>

but in Codd's day proposing anything like this might even have been likely to get you fired. At any rate, once again the motivation for "normal form" is considerations that pertain to the physical implementation level rather than the logical. And in this case, even those considerations are now obsoleted by Moore's law doing its job all that time.

The final motivation presented for the normal form has to do with "the simplicity of names" :

"If the user's relational model is set up in normal form, names of items of data in the data bank can take a simpler form than would otherwise be the case. A general name would take a form such as   R (g).r.d   where R is a relational name; g is a generation identifier (optional); r is a role name (optional); d is a domain name. Since g is needed only when several generations of a given relation exist, or are anticipated to exist, and r is needed only when the relation R has two or more domains named d, the simple form R.d will often be adequate".

This argument is merely about the ergonomics of programmers typing names in their IDE. Once again it is important to note the time context and consider that in Codd's day, editors were deprived of the autocomplete features that modern IDE users have grown accustomed to, and as a consequence long chains of qualifiers in a name (x.y.z.p.q.r and the like) were considered programmer-unfriendly (since the programmer mostly had to type that entire name character by character) and therefore very undesirable in computing languages. These days no one stumbles anymore over lengthy constructs such as getX().getY().getZ().getP().getQ().getR().

And while the grounds for this motivation have thus been obsoleted by time, as it were, even then two more things can be remarked here.

First, Codd notes that "the simple form R.d will often [in most cases] be adequate", acknowledging that if some domain is not unique in R, then the longer form R.r.d will be needed. If the issue was really that important, the obvious solution was plain to see : just mandate the role name, which would ensure that "the simple form R.r" will ***always*** be adequate.

Second, he was thinking of this name usage scheme in comparison to what he thought we'd be getting with "nonsimple domains", which he probably suspected would turn out to things like methods.signature.ordinal or employee.jobhistory.salaryhistory.salary. See the note about how the thinking about such longwinded naming constructs has evolved over time.

In the subsequent section, "linguistic aspects", the following is observed as an advantage of normal form :

"The adoption of a relational model of data permits the development of a universal data sublanguage based on an applied predicate calculus. A firstorder
predicate calculus suffices if the collection of relations is in normal form."

The observation as it stands is of course correct. The trap to stumble into is in setting it up as an argument to blanket rule out "nonsimple domains". Blanket mandating "normal form" (in the form of providing only a data sublanguage restricted to first-order) just means the RM becomes unsuitable for handling certain [classes of] problems. You have such a problem ? Tough luck, look elsewhere for a solution.

The right way to reason with this correct observation is of course exactly the other way round : can we simply assume that in general, all problems will lend themselves to be modeled with only relations in normal form ? No ? Then we can't afford to restrict the data sublanguage offered by the DBMS to first order.

Friday, January 4, 2019

Some further analysis and observations on a statement about CHECK constraints

"It is not a-priori knowledgeable whether and which of the other rules are expressible with the CHECK syntax in a given SQL dialect"

Of course it's never answerable with full 100% accuracy, but it is possible to come pretty damn close.

(1) From the SQL standard, book 1, "Framework" :

"A table check constraint specifies a search condition. The constraint is violated if the result of the search condition is false for any row of the table ( ... irrelevant amendment for NULL cases here ...)."

Considering that the CHECK constraint is supposed to be satisfied by every single row in isolation,
that all currently present rows satisfy the CHECK because they have been CHECKed when they were inserted (and re-checked when updated),

it seems we can conclude that the CHECK expression defines the condition to be satisfied by all newly inserted rows (where an UPDATE is to be considered the semantic equivalent of a "simultaneous" DELETE (of the old row value) and INSERT (of the new row value) ). And subsequently, that SQL will execute CHECK constraints only for INSERT-type updates (to repeat, UPDATE is also an INSERT-type update because some INSERT is semantically implied - last time I'll explicitly repeat this). This conclusion is factually wrong unless certain conditions are satisfied - see last additional note - but it is one that the SQL engineers have indeed drawn and hence SQL implementations indeed behave as described above.

(2) Ceri/Widom have shown ( https://pdfs.semanticscholar.org/850b/e943c3b45f32de203a277dfbf556ffbbda21.pdf - "deriving invalidating operations") that it is possible to compute from a formally stated database constraint, the set of types of update operation that can potentially violate the constraint. "Type of update operation" standing for : INSERT INTO X, DELETE FROM Y, ... which can be formalized as (update-type, relvar) pairs (equate "relvar" with "table" if you're not familiar with the term - it's close enough for the purposes of this post).

Meaning that if that set includes one single pair in which update-type is DELETE, then CHECK constraints alone cannot possibly suffice to enforce the database constraint (business rule) entirely.

(Aside : this is the very reason why SQL has FKs as a distinguished concept and "REFERENCES" syntax to define them. FKs can indeed be violated by a DELETE-type operation and there was nothing else in the language to fire integrity checks execution from a DELETE-type operation. So the SQL creators addressed the problem by introducing a dedicated concept that the run-time integrity check could be "attached to". Date/Darwen have demonstrated at length that a truly relational system that supports *all* types of constraint has no need for dedicated "REFERENCES" syntax, because in their own words, "the longhand often turns out shorter than the shorthand".)

(3) That leaves only the database constraints which can only ever be violated by some INSERT-type update. This is to be split in two cases :
(3a) the only variables in the search condition in the CHECK constraint, are attribute/column values of the newly inserted tuple/row specifically (that is, the outcome of the CHECK is determined by the newly inserted tuple/row value exclusively), then in SQL this constraint will be supported by the engine, unless the engine does not offer CHECK support altogether (a minority I expect).
(3b) there is at least one variable in the search condition other than just attributes/columns from the newly inserted tuple/row. Something related to other tuples/rows from the same relvar/table, or tuples/rows from other relvars/tables altogether. These cases are addressed by SQL feature F671, "subqueries in CHECK constraints" so whether your given engine supports these cases is determined by whether your given engine supports SQL standard feature F671 (even to this very day, I believe none of the big dogs has it let alone the smaller ones).

(4) (Just to give an indication of how big the percentages of each case are in a real-world db :) The SIRA_PRISE catalog effectively contains all applicable business rules as a declared constraint. Disregarding type constraints (which "derive" directly from the declaration of a heading and which limits the set of values to those within a known data type), the counts are as follows :

Tuple constraints (single-row CHECK constraints / 1OP 2OP properties in FP parlance) : TBC
Relvar constraints (single-table constraints / 3OP properties in FP parlance) : TBC
Database constraints (multi-table constraints / 4OP properties in FP parlance) : TBC
Database constraints that are DELETE-sensitive (i.e. not supportable by CHECK) : TBC
Database constraints that are not DELETE-sensitive (i.e. supportable by CHECK) : TBC

Relvar constraints that are keys : TBC
Relvar constraints that are not keys : TBC

Database constraints that are FKs : TBC
Database constraints that are not FKs : TBC

Additional notes :

(2)
There is some give and take, stretch if you will, in the analysis mentioned in (2).

Take for example a constraint "total amount must at all times exceed (or be equal to) 5". Where "total amount" is the summation of the values in some attribute/column of some [subset of] rows in some relvar/table. On the face of it one might figure that INSERT-type updates are never going to fail that check. But that is only so if the rows are guaranteed never to contain negative values ! So the DBMS' constraint analysis procedure taking this conclusion is predicated on the system being able to detect that guarantee (presumably from some other declared constraint, but potentially also from a type definition), and that's nowhere near trivial for a DBMS to do (because *this* is the point where "the DBMS genuinely understanding the data" enters the picture - for some sense of "genuinely understanding").

(3)
Some designers try to bypass the absence of F671 by hiding the access to other tables in a user-defined function so that their search condition looks more like "WHERE ... MY_UDF(inserted_row) ..." which on the face of it makes things look like this is a (3a) case, but *semantically* of course it still isn't. And they often find themselves trapped in / caught by unanticipated timing-and-sequence-related dependencies that the DBMS keeps for itself (rightly so at that). If/when that happens that's usually when another new question appears on Stackoverflow :-) .

(3b)
Even if you have a system that supports F671, there is still extremely sensitive transformation to be performed by the database designer. You have one single business rule, but you have >1 relvar/table, the INSERT to each of which might violate it. As a database designer, you're still obliged to rewrite the single business rule into the right expression to check for when the INSERT happens to this table, or to that other table, and so on for each table involved (to be precise : each table in the set of (INSERT, table) pairs you got from the analysis phase in (2) - and note that this therefore requires a tool that will perform that analysis in (2) !). These transformations are so tedious, difficult (literally a minefield full of traps) and error-prone that I bluntly wrote in my NOCOUG article on the subject [something to the effect] that "Ordinary humans can't do it, period".

(last note)
The original text of this post also had "CHECK constraints can only ever be violated for newly INSERTed rows at the time of their insertion" as an argument leading to the conclusion that "the CHECK expression defines the condition to be satisfied by all newly inserted rows". We already mentioned that this conclusion is factually wrong, and here's why.

Suppose a row r1 is inserted in table T1, and it satisfies the CHECK constraint. But the CHECK constraint yielding TRUE is predicated/dependent on the existence of some particular row r2 in some table T2 (T2 may be the same as T1 or different). That is, the CHECK constraint is a least a cross-row constraint and possibly a cross-table one. In FP parlance, it corresponds to a 3OP or a 4OP. Removing all rows from T2 [that make the particular predicate for r1's CHECK constraint yield TRUE] will obviously cause r1 in T1 to suddenly fail its CHECK constraint, which should obviously require the removal -the DELETE- to be rejected by the DBMS. So it is not always true that "CHECK constraints for a row rx can only ever be violated upon insertion of that very row value rx". This is only so if the CHECK constraint's outcome does not depend on the either-or-not existence of any other row in any table. That is why F671 is a separate SQL feature, and moreover one that is not supported [producing correct and reliable results] by any SQL engine I know : the analysis [of all the distinct ways in which r1's CHECK constraint outcome is dependent on any of the involved other rows in any of the involved other tables in the database] is just too hard to make, especially so in the calculus-based language that SQL is.