Wednesday, August 28, 2019

Codd's normal form debunked

The title says it all.  Following is a bit of commentary on what Codd wrote in section 1.4 of his 1970 paper, "A Relational Model of Data for large shared data banks".  That particular section introduces the notion of a "Normal Form", what benefits it allegedly achieves, and why it is allegedly always achievable.  The "Normal Form" introduced by Codd here, later continued its life as "First Normal Form", when functional dependency theory emerged/evolved and led to additional normal forms.



Codd begins the section with

"A relation whose domains are all simple can be represented in storage by a two-dimensional column-homogeneous array of the kind discussed above. Some more complicated data structure is necessary for a relation with one or more nonsimple domains. For this reason (and others to be cited below) the possibility of eliminating nonsimple domains appears worth investigating."

The observations Codd makes about "representation in storage" are correct, if one considers the timeframe this was set in (1970, to repeat), the vendor he was working for at the time and the kind of IT segment that vendor's customers were mainly situated in (the 'B' in IBM stands for 'Business'), and the kind of languages that reigned supreme in that particular IT segment (COBOL and PL/1 e.g.).

But "representation in storage" is clearly an implementation detail, a consideration at the physical/implementation level, and for this very reason should never ever have been given the weight Codd gives it here as an argument for "striving for normal form" (which is a concern situated at the logical level).  Basically, all he's saying here is "we must always strive for normal form because then we can keep the implementation simpler".  That is in fact highly curious, given that in the paragraph immediately preceding this whole section, Codd acknowledges the possibility of "nonsimple domains" :

"Nonatomic values can be discussed within the relational framework. Thus, some domains may have relations as elements. These relations may, in turn, be defined on nonsimple domains, and so on."

and concludes that paragraph with the observation that

"Much of the confusion in present terminology is due to failure to distinguish between ... and between components of a user model of the data on the one hand and their machine representation counterparts on the other hand".

To see that -worthwhile- observation followed by a recommendation to strive for some "normal form" in "the user model of the data" but to see that recommendation supported only by arguments regarding "machine representation counterparts" (we'll get back to that "and others to be cited below" below) is highly ironic/dubious to say the least.



The explanatory employee example with nested relations jobhistory and salaryhistory is clear and correct enough, but the really interesting part are the few paragraphs immediately following that :

"If normalization as described above is to be applicable, the unnormalized collection of relations must satisfy the following conditions:
(1) The graph of interrelationships of the nonsimple domains is a collection of trees.
(2) No primary key has a component domain which is nonsimple."

The normalization procedure Codd described is essentially based on applications of what in the TTM world has come to be known (+-) as the UNGROUP operator.  (Caveat : TTM defines UNGROUP as an algebraic operator operating on relation values and yielding other relation values, in the context of the Normal Form the operation must be seen as operating on ***relational schemas*** and yielding ***other relational schemas***.)

TTM's UNGROUP is known not to be information-preserving if one of the RVA's being UNGROUPed is part of the key of the relation subjected to UNGROUP.  (Information-preserving in the sense that there is some other operator invocation (or nesting thereof) that is guaranteed to return the original from the UNGROUP result, in all possible cases.)  Codd correctly understood that and thus correctly identifies (2) as a precondition for his normalization procedure to be applicable : you cannot "UNGROUP" or "normalize" an RVA/nonsimple domain away if the concerned attribute is part of the key.

And the most interesting part is the concluding sentence :

"The writer knows of no application which would require any relaxation of these conditions."

and this is clearly presented as an argument to the effect that "we need not worry that this is going to pose any real restriction in any real-world scenario".  (In fact, he's almost plain saying "I never heard about or saw any such application, therefore it must be that they don't exist".)  Well too bad for Codd, but this writer does "know of such applications", and here's the most compelling one :

Suppose we wanted to put all the information we now get from javadoc in a relational database.  That would e.g. provide us with a simple means to have questions answered such as "give me all the classes X that have a superclass Y and where that superclass has a static method with the same method signature as any nonstatic method of X".  (Just a matter of writing the query - try to achieve same using javadoc-generated doco as it stands right now.)

Observe that in java, "method signature" (within a given class) consists of the method name plus the total of all the method parameter type declarations, the latter being representible at first instance as an array of type declarations, e.g. an array [int, Date, boolean].  So in our relational schema for "methods" we'd have an attribute 'methodName' (say of type String), and an attribute 'signature' of a type that has to be at least semantically isomorphic to "array of type declarations".  The canonical way to turn arrays into relations is by introducing an attribute 'ordinal' and denoting each array element as a tuple with (a) the correct value for 'ordinal' and the array element itself for some attributename chosen appropriately to reflect the content :

REL{TUP{ordinal 3; typename "boolean"}TUP{ordinal 1; typename "int"}TUP{ordinal 2; typename "Date"}TUP{...}}

In tabular form :

+------------+-----------------------------------------------+
! methodname ! signature                                     !
+------------+-----------------------------------------------+
! "doStuff"  ! +---------+--------------------+              !
!            ! ! ordinal ! typename           !              !
!            ! +---------+--------------------+              !
!            ! !       3 ! "boolean"          !              !
!            ! +---------+--------------------+              !
!            ! !       1 ! "int"              !              !
!            ! +---------+--------------------+              !
!            ! !       2 ! "Date"             !              !
!            ! +---------+--------------------+              !
+------------+-----------------------------------------------+
! "doStuff"  ! +---------+--------------------+              !
!            ! ! ordinal ! typename           !              !
!            ! +---------+--------------------+              !
!            ! !       1 ! "Date"             !              !
!            ! +---------+--------------------+              !
+------------+-----------------------------------------------+


Using this structure to keep our javadoc in a relational database, it's clear the key of the relvar for methods information must be {methodname signature}.  And thus it doesn't satisfy the preconditions for Codd's normalization procedure.

(In recent java, there's also this "varargs" feature that would have to be covered, but we've made abstraction of that in our example because it only gives rise to further complication without being material to the point we're making about the normalization procedure.)



The final remaining bits and pieces in the 1970 paper relate to that "and others cited below" (and thus allegedly lay out more reasons why "normal form" is something to be strived for) :

"The simplicity of the array representation which becomes feasible when all relations are cast in normal form is not only an advantage for storage purposes but also for communication of bulk data between systems which use widely different representations of the data"

Once again we must remind ourselves of the hardware speeds of the day to put this remark in its proper context : things like serialization streams weren't used for "bulk communication" because it inherently creates a need for "parsing" activity on the receiving end, and as we all know, "parsing" has always been and is still a comparatively costly operation.  40 years of Moore's law has brought that cost down to acceptable levels by now, so it's only rarely still an issue these days, but in the early seventies that cost was just prohibitive.  For example, these days it's typically nowhere near an issue to communicate stuff like our javadoc methods relations in the form of an XML string

<method><name>doStuff</name><signature><relation><tuple><ord>1</ord><type>Date</type></tuple></relation></signature></method>

but in Codd's day proposing anything like this might even have been likely to get you fired.  At any rate, once again the motivation for "normal form" is considerations that pertain to the physical implementation level rather than the logical.  And in this case, even those considerations are now obsoleted by Moore's law doing its job all that time.

The final motivation presented for the normal form has to do with "the simplicity of names" :

"If the user's relational model is set up in normal form, names of items of data in the data bank can take a simpler form than would otherwise be the case. A general name would take a form such as   R (g).r.d   where R is a relational name; g is a generation identifier (optional); r is a role name (optional); d is a domain name.  Since g is needed only when several generations of a given relation exist, or are anticipated to exist, and r is needed only when the relation R has two or more domains named d, the simple form R.d will often be adequate".

This argument is merely about the ergonomics of programmers typing names in their IDE.  Once again it is important to note the time context and consider that in Codd's day, editors were deprived of the autocomplete features that modern IDE users have grown accustomed to, and as a consequence long chains of qualifiers in a name (x.y.z.p.q.r and the like) were considered programmer-unfriendly (since the programmer mostly had to type that entire name character by character) and therefore very undesirable in computing languages.  These days no one stumbles anymore over lengthy constructs such as getX().getY().getZ().getP().getQ().getR().

And while the grounds for this motivation have thus been obsoleted by time, as it were, even then two more things can be remarked here.

First, Codd notes that "the simple form R.d will often [in most cases] be adequate", acknowledging that if some domain is not unique in R, then the longer form R.r.d will be needed.  If the issue was really that important, the obvious solution was plain to see : just mandate the role name, which would ensure that "the simple form R.r" will ***always*** be adequate.

Second, he was thinking of this name usage scheme in comparison to what he thought we'd be getting with "nonsimple domains", which he probably suspected would turn out to things like methods.signature.ordinal or employee.jobhistory.salaryhistory.salary.  See the note about how the thinking about such longwinded naming constructs has evolved over time.



In the subsequent section, "linguistic aspects", the following is observed as an advantage of normal form :

"The adoption of a relational model of data permits the development of a universal data sublanguage based on an applied predicate calculus. A firstorder
predicate calculus suffices if the collection of relations is in normal form."

The observation as it stands is of course correct.  The trap to stumble into is in setting it up as an argument to blanket rule out "nonsimple domains".  Blanket mandating "normal form" (in the form of providing only a data sublanguage restricted to first-order) just means the RM becomes unsuitable for handling certain [classes of] problems.  You have such a problem ?  Tough luck, look elsewhere for a solution.

The right way to reason with this correct observation is of course exactly the other way round : can we simply assume that in general, all problems will lend themselves to be modeled with only relations in normal form ?  No ?  Then we can't afford to restrict the data sublanguage offered by the DBMS to first order.