Wednesday, August 28, 2019

Codd's normal form debunked

The title says it all.  Following is a bit of commentary on what Codd wrote in section 1.4 of his 1970 paper, "A Relational Model of Data for large shared data banks".  That particular section introduces the notion of a "Normal Form", what benefits it allegedly achieves, and why it is allegedly always achievable.  The "Normal Form" introduced by Codd here, later continued its life as "First Normal Form", when functional dependency theory emerged/evolved and led to additional normal forms.

Codd begins the section with

"A relation whose domains are all simple can be represented in storage by a two-dimensional column-homogeneous array of the kind discussed above. Some more complicated data structure is necessary for a relation with one or more nonsimple domains. For this reason (and others to be cited below) the possibility of eliminating nonsimple domains appears worth investigating."

The observations Codd makes about "representation in storage" are correct, if one considers the timeframe this was set in (1970, to repeat), the vendor he was working for at the time and the kind of IT segment that vendor's customers were mainly situated in (the 'B' in IBM stands for 'Business'), and the kind of languages that reigned supreme in that particular IT segment (COBOL and PL/1 e.g.).

But "representation in storage" is clearly an implementation detail, a consideration at the physical/implementation level, and for this very reason should never ever have been given the weight Codd gives it here as an argument for "striving for normal form" (which is a concern situated at the logical level).  Basically, all he's saying here is "we must always strive for normal form because then we can keep the implementation simpler".  That is in fact highly curious, given that in the paragraph immediately preceding this whole section, Codd acknowledges the possibility of "nonsimple domains" :

"Nonatomic values can be discussed within the relational framework. Thus, some domains may have relations as elements. These relations may, in turn, be defined on nonsimple domains, and so on."

and concludes that paragraph with the observation that

"Much of the confusion in present terminology is due to failure to distinguish between ... and between components of a user model of the data on the one hand and their machine representation counterparts on the other hand".

To see that -worthwhile- observation followed by a recommendation to strive for some "normal form" in "the user model of the data" but to see that recommendation supported only by arguments regarding "machine representation counterparts" (we'll get back to that "and others to be cited below" below) is highly ironic/dubious to say the least.

The explanatory employee example with nested relations jobhistory and salaryhistory is clear and correct enough, but the really interesting part are the few paragraphs immediately following that :

"If normalization as described above is to be applicable, the unnormalized collection of relations must satisfy the following conditions:
(1) The graph of interrelationships of the nonsimple domains is a collection of trees.
(2) No primary key has a component domain which is nonsimple."

The normalization procedure Codd described is essentially based on applications of what in the TTM world has come to be known (+-) as the UNGROUP operator.  (Caveat : TTM defines UNGROUP as an algebraic operator operating on relation values and yielding other relation values, in the context of the Normal Form the operation must be seen as operating on ***relational schemas*** and yielding ***other relational schemas***.)

TTM's UNGROUP is known not to be information-preserving if one of the RVA's being UNGROUPed is part of the key of the relation subjected to UNGROUP.  (Information-preserving in the sense that there is some other operator invocation (or nesting thereof) that is guaranteed to return the original from the UNGROUP result, in all possible cases.)  Codd correctly understood that and thus correctly identifies (2) as a precondition for his normalization procedure to be applicable : you cannot "UNGROUP" or "normalize" an RVA/nonsimple domain away if the concerned attribute is part of the key.

And the most interesting part is the concluding sentence :

"The writer knows of no application which would require any relaxation of these conditions."

and this is clearly presented as an argument to the effect that "we need not worry that this is going to pose any real restriction in any real-world scenario".  (In fact, he's almost plain saying "I never heard about or saw any such application, therefore it must be that they don't exist".)  Well too bad for Codd, but this writer does "know of such applications", and here's the most compelling one :

Suppose we wanted to put all the information we now get from javadoc in a relational database.  That would e.g. provide us with a simple means to have questions answered such as "give me all the classes X that have a superclass Y and where that superclass has a static method with the same method signature as any nonstatic method of X".  (Just a matter of writing the query - try to achieve same using javadoc-generated doco as it stands right now.)

Observe that in java, "method signature" (within a given class) consists of the method name plus the total of all the method parameter type declarations, the latter being representible at first instance as an array of type declarations, e.g. an array [int, Date, boolean].  So in our relational schema for "methods" we'd have an attribute 'methodName' (say of type String), and an attribute 'signature' of a type that has to be at least semantically isomorphic to "array of type declarations".  The canonical way to turn arrays into relations is by introducing an attribute 'ordinal' and denoting each array element as a tuple with (a) the correct value for 'ordinal' and the array element itself for some attributename chosen appropriately to reflect the content :

REL{TUP{ordinal 3; typename "boolean"}TUP{ordinal 1; typename "int"}TUP{ordinal 2; typename "Date"}TUP{...}}

In tabular form :

! methodname ! signature                                     !
! "doStuff"  ! +---------+--------------------+              !
!            ! ! ordinal ! typename           !              !
!            ! +---------+--------------------+              !
!            ! !       3 ! "boolean"          !              !
!            ! +---------+--------------------+              !
!            ! !       1 ! "int"              !              !
!            ! +---------+--------------------+              !
!            ! !       2 ! "Date"             !              !
!            ! +---------+--------------------+              !
! "doStuff"  ! +---------+--------------------+              !
!            ! ! ordinal ! typename           !              !
!            ! +---------+--------------------+              !
!            ! !       1 ! "Date"             !              !
!            ! +---------+--------------------+              !

Using this structure to keep our javadoc in a relational database, it's clear the key of the relvar for methods information must be {methodname signature}.  And thus it doesn't satisfy the preconditions for Codd's normalization procedure.

(In recent java, there's also this "varargs" feature that would have to be covered, but we've made abstraction of that in our example because it only gives rise to further complication without being material to the point we're making about the normalization procedure.)

The final remaining bits and pieces in the 1970 paper relate to that "and others cited below" (and thus allegedly lay out more reasons why "normal form" is something to be strived for) :

"The simplicity of the array representation which becomes feasible when all relations are cast in normal form is not only an advantage for storage purposes but also for communication of bulk data between systems which use widely different representations of the data"

Once again we must remind ourselves of the hardware speeds of the day to put this remark in its proper context : things like serialization streams weren't used for "bulk communication" because it inherently creates a need for "parsing" activity on the receiving end, and as we all know, "parsing" has always been and is still a comparatively costly operation.  40 years of Moore's law has brought that cost down to acceptable levels by now, so it's only rarely still an issue these days, but in the early seventies that cost was just prohibitive.  For example, these days it's typically nowhere near an issue to communicate stuff like our javadoc methods relations in the form of an XML string


but in Codd's day proposing anything like this might even have been likely to get you fired.  At any rate, once again the motivation for "normal form" is considerations that pertain to the physical implementation level rather than the logical.  And in this case, even those considerations are now obsoleted by Moore's law doing its job all that time.

The final motivation presented for the normal form has to do with "the simplicity of names" :

"If the user's relational model is set up in normal form, names of items of data in the data bank can take a simpler form than would otherwise be the case. A general name would take a form such as   R (g).r.d   where R is a relational name; g is a generation identifier (optional); r is a role name (optional); d is a domain name.  Since g is needed only when several generations of a given relation exist, or are anticipated to exist, and r is needed only when the relation R has two or more domains named d, the simple form R.d will often be adequate".

This argument is merely about the ergonomics of programmers typing names in their IDE.  Once again it is important to note the time context and consider that in Codd's day, editors were deprived of the autocomplete features that modern IDE users have grown accustomed to, and as a consequence long chains of qualifiers in a name (x.y.z.p.q.r and the like) were considered programmer-unfriendly (since the programmer mostly had to type that entire name character by character) and therefore very undesirable in computing languages.  These days no one stumbles anymore over lengthy constructs such as getX().getY().getZ().getP().getQ().getR().

And while the grounds for this motivation have thus been obsoleted by time, as it were, even then two more things can be remarked here.

First, Codd notes that "the simple form R.d will often [in most cases] be adequate", acknowledging that if some domain is not unique in R, then the longer form R.r.d will be needed.  If the issue was really that important, the obvious solution was plain to see : just mandate the role name, which would ensure that "the simple form R.r" will ***always*** be adequate.

Second, he was thinking of this name usage scheme in comparison to what he thought we'd be getting with "nonsimple domains", which he probably suspected would turn out to things like methods.signature.ordinal or employee.jobhistory.salaryhistory.salary.  See the note about how the thinking about such longwinded naming constructs has evolved over time.

In the subsequent section, "linguistic aspects", the following is observed as an advantage of normal form :

"The adoption of a relational model of data permits the development of a universal data sublanguage based on an applied predicate calculus. A firstorder
predicate calculus suffices if the collection of relations is in normal form."

The observation as it stands is of course correct.  The trap to stumble into is in setting it up as an argument to blanket rule out "nonsimple domains".  Blanket mandating "normal form" (in the form of providing only a data sublanguage restricted to first-order) just means the RM becomes unsuitable for handling certain [classes of] problems.  You have such a problem ?  Tough luck, look elsewhere for a solution.

The right way to reason with this correct observation is of course exactly the other way round : can we simply assume that in general, all problems will lend themselves to be modeled with only relations in normal form ?  No ?  Then we can't afford to restrict the data sublanguage offered by the DBMS to first order.

Friday, January 4, 2019

Some further analysis and observations on a statement about CHECK constraints

"It is not a-priori knowledgeable whether and which of the other rules are expressible with the CHECK syntax in a given SQL dialect"

Of course it's never answerable with full 100% accuracy, but it is possible to come pretty damn close.

(1) From the SQL standard, book 1, "Framework" :

"A table check constraint specifies a search condition. The constraint is violated if the result of the search condition is false for any row of the table ( ... irrelevant amendment for NULL cases here ...)."

Considering that the CHECK constraint is supposed to be satisfied by every single row in isolation,
that all currently present rows satisfy the CHECK because they have been CHECKed when they were inserted (and re-checked when updated),

it seems we can conclude that the CHECK expression defines the condition to be satisfied by all newly inserted rows (where an UPDATE is to be considered the semantic equivalent of a "simultaneous" DELETE (of the old row value) and INSERT (of the new row value) ).  And subsequently, that SQL will execute CHECK constraints only for INSERT-type updates (to repeat, UPDATE is also an INSERT-type update because some INSERT is semantically implied - last time I'll explicitly repeat this).   This conclusion is factually wrong unless certain conditions are satisfied - see last additional note - but it is one that the SQL engineers have indeed drawn and hence SQL implementations indeed behave as described above.

(2) Ceri/Widom have shown ( - "deriving invalidating operations") that it is possible to compute from a formally stated database constraint, the set of types of update operation that can potentially violate the constraint.  "Type of update operation" standing for : INSERT INTO X, DELETE FROM Y, ... which can be formalized as (update-type, relvar) pairs (equate "relvar" with "table" if you're not familiar with the term - it's close enough for the purposes of this post).

Meaning that if that set includes one single pair in which update-type is DELETE, then CHECK constraints alone cannot possibly suffice to enforce the database constraint (business rule) entirely.

(Aside : this is the very reason why SQL has FKs as a distinguished concept and "REFERENCES" syntax to define them.  FKs can indeed be violated by a DELETE-type operation and there was nothing else in the language to fire integrity checks execution from a DELETE-type operation.  So the SQL creators addressed the problem by introducing a dedicated concept that the run-time integrity check could be "attached to".  Date/Darwen have demonstrated at length that a truly relational system that supports *all* types of constraint has no need for dedicated "REFERENCES" syntax, because in their own words, "the longhand often turns out shorter than the shorthand".)

(3) That leaves only the database constraints which can only ever be violated by some INSERT-type update.  This is to be split in two cases :
(3a) the only variables in the search condition in the CHECK constraint, are attribute/column values of the newly inserted tuple/row specifically (that is, the outcome of the CHECK is determined by the newly inserted tuple/row value exclusively), then in SQL this constraint will be supported by the engine, unless the engine does not offer CHECK support altogether (a minority I expect).
(3b) there is at least one variable in the search condition other than just attributes/columns from the newly inserted tuple/row.  Something related to other tuples/rows from the same relvar/table, or tuples/rows from other relvars/tables altogether.  These cases are addressed by SQL feature F671, "subqueries in CHECK constraints" so whether your given engine supports these cases is determined by whether your given engine supports SQL standard feature F671 (even to this very day, I believe none of the big dogs has it let alone the smaller ones).

(4) (Just to give an indication of how big the percentages of each case are in a real-world db :)  The SIRA_PRISE catalog effectively contains all applicable business rules as a declared constraint.  Disregarding type constraints (which "derive" directly from the declaration of a heading and which limits the set of values to those within a known data type), the counts are as follows :

Tuple constraints (single-row CHECK constraints / 1OP 2OP properties in FP parlance) :  TBC
Relvar constraints (single-table constraints / 3OP properties in FP parlance) :  TBC
Database constraints (multi-table constraints / 4OP properties in FP parlance) :  TBC
Database constraints that are DELETE-sensitive (i.e. not supportable by CHECK) :  TBC
Database constraints that are not DELETE-sensitive (i.e. supportable by CHECK) :  TBC

Relvar constraints that are keys :  TBC
Relvar constraints that are not keys :  TBC

Database constraints that are FKs :  TBC
Database constraints that are not FKs :  TBC

Additional notes :

There is some give and take, stretch if you will, in the analysis mentioned in (2).

Take for example a constraint "total amount must at all times exceed (or be equal to) 5".  Where "total amount" is the summation of the values in some attribute/column of some [subset of] rows in some relvar/table.  On the face of it one might figure that INSERT-type updates are never going to fail that check.  But that is only so if the rows are guaranteed never to contain negative values !  So the DBMS' constraint analysis procedure taking this conclusion is predicated on the system being able to detect that guarantee (presumably from some other declared constraint, but potentially also from a type definition), and that's nowhere near trivial for a DBMS to do (because *this* is the point where "the DBMS genuinely understanding the data" enters the picture - for some sense of "genuinely understanding").

Some designers try to bypass the absence of F671 by hiding the access to other tables in a user-defined function so that their search condition looks more like "WHERE ... MY_UDF(inserted_row) ..." which on the face of it makes things look like this is a (3a) case, but *semantically* of course it still isn't.  And they often find themselves trapped in / caught by unanticipated timing-and-sequence-related dependencies that the DBMS keeps for itself (rightly so at that).  If/when that happens that's usually when another new question appears on Stackoverflow :-) .

Even if you have a system that supports F671, there is still extremely sensitive transformation to be performed by the database designer.  You have one single business rule, but you have >1 relvar/table, the INSERT to each of which might violate it.  As a database designer, you're still obliged to rewrite the single business rule into the right expression to check for when the INSERT happens to this table, or to that other table, and so on for each table involved (to be precise : each table in the set of (INSERT, table) pairs you got from the analysis phase in (2) - and note that this therefore requires a tool that will perform that analysis in (2) !).  These transformations are so tedious, difficult (literally a minefield full of traps) and error-prone that I bluntly wrote in my NOCOUG article on the subject [something to the effect] that "Ordinary humans can't do it, period".

(last note)
The original text of this post also had "CHECK constraints can only ever be violated for newly INSERTed rows at the time of their insertion" as an argument leading to the conclusion that "the CHECK expression defines the condition to be satisfied by all newly inserted rows".  We already mentioned that this conclusion is factually wrong, and here's why.

Suppose a row r1 is inserted in table T1, and it satisfies the CHECK constraint.  But the CHECK constraint yielding TRUE is predicated/dependent on the existence of some particular row r2 in some table T2 (T2 may be the same as T1 or different).  That is, the CHECK constraint is a least a cross-row constraint and possibly a cross-table one.  In FP parlance, it corresponds to a 3OP or a 4OP.  Removing all rows from T2 [that make the particular predicate for r1's CHECK constraint yield TRUE] will obviously cause r1 in T1 to suddenly fail its CHECK constraint, which should obviously require the removal -the DELETE- to be rejected by the DBMS.  So it is not always true that "CHECK constraints for a row rx can only ever be violated upon insertion of that very row value rx".  This is only so if the CHECK constraint's outcome does not depend on the either-or-not existence of any other row in any table.  That is why F671 is a separate SQL feature, and moreover one that is not supported [producing correct and reliable results] by any SQL engine I know : the analysis [of all the distinct ways in which r1's CHECK constraint outcome is dependent on any of the involved other rows in any of the involved other tables in the database] is just too hard to make, especially so in the calculus-based language that SQL is.

Wednesday, August 8, 2018

Why mandating primary keys is a mistake in defining the RM and in RDBMS design.

The relational model of data was intended as a general-purpose model.  General-purpose in there means : suitable for addressing every conceivable business scenario and/or subject matter and/or problem type.  Stress : _every conceivable_.

Here's a relation schema plus FDs :

(A1, A2) with {A1->A2 , A2->A1}

That's perfectly conceivable.  Seasoned modelers will recognize this as a "bidirectional translation table".  Stress _bidirectional_.  Meaning that there *will* be users wanting to query this relation to find A2 value given an A1 value, and there *will* be users wanting to query this relation to find A1 value given an A2 value.  Perfectly conceivable.

Normalization theory informs us that the set of keys for this relation schema is {{A1} {A2}}.

If the RM wants to meet its stated objective of being "general-purpose", it must sensibly support this use case.

Now, either you believe that some key being "primary" is absolutely foundational and MUST be a part of the data model and therefore MUST be an aspect of any logical database design and therefore any DBMS *must force* the designer to make certain choices about this.  And then the consequence is that those involved in [studying] the business of deriving logical database models from conceptual [business] models MUST [also be able to] provide an "algorithm" and/or list of checkpoints or some such that will allow a designer to at least make this choice on well-founded grounds.  If such well-founded grounds do not exist, then any choice is obviously entirely arbitrary, meaning the choice itself does not carry any real meaning, meaning the choice shouldn't have to be made.  Any and all supporters of the idea of mandating primary keys are invited to state their solution/approach for my (A1, A2) case.

Or you believe that "primality" of a key is meaningless and irrelevant and then, well, simply no one has any problem.  Just declare all the keys there are and use any one that suits your purposes as the identifier needed for the business use case at hand.  Observe that if "primality" of a key is not meaningless, then it must be possible to find some aspect of the behaviour of software/data systems that can be supported by "systems-with-primary-keys" but cannot be supported by systems without.  That is, if "primality" is meaningful, there must be some added value somewhere for the software/data system.  If that added value exists, it can be demonstrated.  I have never seen it.  And until it is demonstrated, the only reasonable/rational option is to treat the "primary" in "primary key" as the mere psychological distraction that it is.

Tuesday, May 15, 2018

Why 3VL is unusable in computing for humans.

The following post was triggered by the discussion at and in particular, by the response I got when I made the observation that 3VL has 19683 distinct binary logical operators.  I'll quote the relevant portion of that response here for purposes of retaining context :

Could it be that just as with the 16 binary operators, many of which have relations to one another (e.g. inverses and complements, among others) that the trinary operators could fall into similar groups, which, making the 3^9 number you mentioned seem a whole lot less complex? Could that be why it's neither necessary nor customary to work with all the operators in either sort of logic?

Well yes, they do, and I was already very much aware of that when I wrote my message that this was in reply to, but I wasn't aware of the actual numbers.  So I decided to go do the maths (note in passing that the commenter could have chosen to do that just as well, and that actually without effectively doing that, whatever he says doesn't even constitute an argument but stays at the level of gratuitious handwaving, which is often the only thing such commenters are capable of) and here are the results of that exercise.

First of all, let's inspect in some deeper detail how come there are 16 binary logical operators but when we're asked to sum them op we often get no farther than AND,OR, euhhhhhhhhhhhhhh implication ?

In order to answer that I first want to look at the monadic operators in 2VL.  There are in total 4 such "theoretically possible" operators (I'll name them "T", "F", "I" and "N" respectively) :

IN \ OPER  ! T ! F ! I ! N !
T          ! T ! F ! T ! F !
F          ! T ! F ! F ! T !

The operator named "T" returns 'true' no matter what, i.e. regardless of its input.  Likewise for the operator named "F" which returns 'false' no matter what.  At least from a programming language point of view, there is not much point in actually having these operators in the language, since invoking these is the equivalent of writing the corresponding literal.  So these can certainly be eliminated if all we want to retain is the set of "useful" operators.

The remaining two, "I" and "N" have the property that they constitute *permutations* of the applicable set of truth values : each truth value gets mapped to *some* truth value and no two truth values get mapped to the same truth value.  So these two are both total functions that are also bijections.

But of these two, "I" is not terribly useful either, because it constitutes the identity mapping : each value just gets mapped onto itself.  In programming language terms, there would be little point in ever writing code such as I(<bool xpr>) if we can just as well just write <bool xpr>.

So of these four theoretically possible monadic logical 2VL operators, exactly and only one is actually useful : the one we know as "NOT".  And we'll be applying this one in the next step.

Now onto the binary logical 2VL operators.

As already stated, theoretically 16 such "operators" are possible :

IN \ OPER  ! 1 ! 2 ! 3 ! 4 ! 5 ! 6 ! 7 ! 8 ! 9 ! A ! B ! C ! D ! E ! F ! 0 !
T,T        ! T ! T ! T ! T ! T ! T ! T ! T ! F ! F ! F ! F ! F ! F ! F ! F !
T,F        ! T ! T ! T ! T ! F ! F ! F ! F ! F ! F ! F ! F ! T ! T ! T ! T !
F,T        ! T ! T ! F ! F ! T ! T ! F ! F ! F ! F ! T ! T ! F ! F ! T ! T !
F,F        ! T ! F ! T ! F ! T ! F ! T ! F ! F ! T ! F ! T ! F ! T ! F ! T !

Now 16 names is already getting a bit much to remember all of them, so we go looking for ways to reduce this set of 16 operators to a smaller one that is manageable to remember.

As we did with the monadic operators, there are operators like monadic "T" and "F" to be eliminated (binary "1" and "9").  There are also those who "just retain the value of the first IN argument" and "just retain the value of the second IN argument" (binary "4" and "6").

But let's first try something else.  One column is one operator definition.  Let's consider column named "8" (the one we know as "AND").  We could characterize this one as "TFFF" (the result values for the 4 possible input combinations chained together).

Using a "useful monadic operator", we could conceive an operation of "applying the monadic operator to this binary operator characterization" so that from "TFFF" we obtain "FTTT", and that's a characterization for some other operator (the one listed under "0").

In fact, that operation could be carried out for each of the operator characterizations "1"-"8", and we'd obtain one from "9"-"0" in each case.  What this means in language terms is that an invocation of "0" on argument values (x,y) is demonstrably equivalent to invoking "8" on argument values (x,y) and then invoking monadic "N" on that.  (I'm carefully avoiding using NOT here too much because it will matter when we get to the 3VL counterparts).

So we are able to reduce the set of 16 to the set of just "1"-"8" by observing that we can achieve the effects of "9"-"0" also by just using an extra invocation of monadic "N".

And only now is the point where we wish to eliminate the "degenerate" operators that just return a fixed value or just one of its input arguments, unchanged.  We then retain the following 5 binary operators :

IN \ OPER  ! 2 ! 3 ! 5 ! 7 ! 8 !
T,T        ! T ! T ! T ! T ! T !
T,F        ! T ! T ! F ! F ! F !
F,T        ! T ! F ! T ! F ! F !
F,F        ! F ! T ! T ! T ! F !

We immediately recognize columns "2" and "8" as being the ones commonly known as "OR" and "AND" respectively.  But those are typically the *only* two we think of readily and immediately upon seeing the term "binary logical operator".  So what about the other three ?

First, column "7".  Upon inspection, we can see that this is in fact the definition for an operator that could be labeled "boolean equality" : it returns true iff the two input arguments are the same.  We don't usually think of that one as a "logical operator" (and in programming the need doesn't arise all that often for comparing boolean values for equality/being the same) but mathematically it is in fact very much so.  A slightly different light is shed on the situation if we consider the "negated" version, which is "logical inequality", so to speak, which is more commonly known as XOR.  That one *does* get included in some languages as a primitive !  So in fact column "7" reminds us of a useful logical binary operator we often instinctively tend to "forget".

Next, column "5".  Upon inspection, we can see that this is in fact the definition for the operator usually labeled "[material] implication".  Ah yes, the one we usually write as "not(x) or y".  Well, fair enough.  There is an odd thing about material implication ("how can an implication be true if its antecedent is false meaning it can't be tested ?") that probably prohibits syntax such as "X IMPLIES Y" or "IMPLIES(X,Y)" to be equally self-documenting/self-explaining as "X AND Y" or "AND(X,Y)".  Fair enough.

Lastly, column "3".  Here, we can see that it is in some sense "symmetric" with column 5 in that for all x,y : "3"(x,y) === "5"(y,x).  That is, in language terms, if we need to invoke operator "3" we can also achieve that effect by invoking operator "5" and just swapping the arguments.  So usually we don't bother to give this operator its own name (e.g. "IMPLIEDBY") and just let programmers invoke the other one, either through its assigned name (e.g. "IMPLIES") or its equivalent NOT/OR combo.

This ends our survey of how large sets of logical operators are reduced to a much smaller, more manageable set with a very limited number of names to remember.

We will *definitely* need that when we switch to 3VL.

In 3VL, when considering all the theoretically possible monadic operators, we end up having 27 (!!!) of those :

IN \ OPER  ! 0 ! 1 ! 2 ! 3 ! 4 ! 5 ! 6 ! 7 ! 8 ! ... ! 26 !
T          ! T ! T ! T ! T ! T ! T ! T ! T ! T !     !  F !
U          ! T ! T ! T ! U ! U ! U ! F ! F ! F !     !  F !
F          ! T ! U ! F ! T ! U ! F ! T ! U ! F !     !  F !

Of these, just like in 2VL, the ones can be discarded that return T, U and F regardless of the input, already reducing the set to be considered to 24 (!) pieces.

Of these, there are 6 that have a similar characteristic as the two ones that were remaining in 2VL : namely that they constitute permutations (total functions that are bijective) on the set of applicable truth values.

Note that this does not mean that the remaining 18 ones are insignificant or irrelevant.
For example, looking at column "8", we see that this is the definition for an operator that could be named TREAT_U_AS_F, which is the operator that is effectively applied by SQL (tacitly) to any predicate that appears in an SQL WHERE clause (if SQL finds a WHERE clause that evaluates to 'unknown', it will *not* include the row in the result set).
For another example, looking at column "2", we see that this is the definition for an operator that could be named TREAT_U_AS_T, which is the operator that is effectively applied by SQL (tacitly) to any predicate that appears in an SQL CHECK clause (if SQL finds a CHECK clause that evaluates to 'unknown', it *will* consider that CHECK constraint as "satisfied" and not reject an update on behalf of *that* CHECK constraint).
There are plenty of those.  For example, SQL "IS NULL" is the operator that (+-) maps U to T and both T and F to F.  "+-" because SQL "IS NULL" also applies to scalars and here we're purely talking truth values.

Anyway, in a first step and for our present purposes we will only consider the 6 permutation-operators (I've labeled four of them "A"-"D" because I am too lazy to figure out what their number would have been in the "0"-"26" scheme) :

IN \ OPER  ! 5 ! 7 ! A ! B ! C ! D !
T          ! T ! T ! U ! U ! F ! F !
U          ! U ! F ! T ! F ! U ! T !
F          ! F ! U ! F ! T ! T ! U !

Of these, column "C" is the one that is usually defined as the 3VL equivalent of (2VL) "NOT", for the "attractive" property that for all x, "C"("C"(x)) === x which is then the 3VL counterpart of the 2VL tautology NOT(NOT(x)) === x.  (But note that operators "7" and "A" have this property too.)

Now in 2VL at this point we were only left with 1 single operator.  So there was no question of whether that set could be further reduced.  We are not that lucky here, and we would certainly like to not have to continue with 6 distinct names (and corresponding definitions) to remember and this just for the monadic operators alone.  Remember that in 2VL we ended up with a set of at most 5 (NOT OR AND XOR IMPLIES) and in 3VL at this point we already have a larger number than those.

More unfortunately, the operators corresponding to the columns "C", "7" and "A" having the property just described (of "reverting their own effect" just like 2VL NOT does), it means that for all these three operators, no amount of chaining/nesting it will ever make us end up with the operator definition for any of the 4 others (excluding here column "5", which is the identity operator at hand, and which I'll refer to as "I" from this point on).

So if we want to reduce this set of 6 operators by showing that some of them are equivalent to some chaining of invocations of some of the others, we need to look at "B" and "D".

It turns out that
"D"("D"(x)) === "B"(x)
"B"("B"(x)) === "D"(x)
"D"("D"("D"(x))) === "I"(x)
"B"("B"("B"(x))) === "I"(x)
 And no chaining/nesting of these ever ends up at the definition for either of "7", "A" and "C".  That makes sense if one observes that both "B" and "D" "cycle through" the values in a F->U->T->F or T->U->F->T sense, and the three others "swap two values while preserving the third".

For readability, we'll introduce the names "PROM" for "D" - because it "promotes" F to U, U to T and T round to F - and "DEM" for "B" - because it "demotes" truth values in a similar sense.  Likewise, we'll refer to the monadic operator "C" as NOT.

These observations show that a minimal set to express all of these six monadic 3VL operators must at least consist of one of {PROM DEM} and NOT (or one of the "7" and "A" operators, but that choice would be highly questionable from an intuitiveness point of view).  We have not proved that the set {PROM NOT} suffices to arrive at the operators "7" and "A" but we believe this to be the case and we proceed on the mere conjecture of this.

Now here's an interesting experiment.  Given any random chaining/nesting of invocations of PROM/NOT, point in the table with the 6 operator definitions which particular one the nesting is equivalent to.  Do this *WITHIN THE SAME TIMEFRAME* that you would need to just count the NOTs and tell whether it is odd or even (amounting to NOT or identity).  Do you think *ANYONE* would be capable of this ?  I don't.  Just try it.


Timed it ?  And it can get worse.  We already observed that the 6 permutation operators are not the only _relevant_ ones (for being used in truth value computations).  Do the same exercise in the 27-column table for


Which one of number 0-26 is it ?  If one is operating in an environment where these operators can *all of them* actually be used, it seems relatively important to be just as proficient in this as in just counting NOTs and assessing whether the count is even or odd.  But that's a tall order, and I believe it to be beyond any of us humans.

This should already be giving you a start of a sense why 3VL is, actually, totally unmasterable for the human mind.  But it gets still worse.  3VL too has binary operators.

19683 of them, theoretically speaking, to be precise.  I'm not going to provide them all in tabular form.  And if there is any set we want to see reduced for memorizability, it's this one.

Well we sure can.  There's the three "degenerate" operators that always return T, U, F, respectively, regardless of the inputs.  Discard those from the set to retain 19680.  (BTW Reason I'm doing this now is that I can't reduce these three operators to half a case by applying the six distinct monadic permutation operators to their TTTTTTTTT, UUUUUUUUU, FFFFFFFFF characterizations.

To these remaining 19680, apply the six distinct characterization transforms to have only 3560 left, hoping that the six distinct characterization transforms applied to these 3560 will effectively cover all of the 19680 operators there are (this has *NOT* been proved).

Subsequently, inspect that set of 3560 for individual pairs that expose characteristics of "symmetry" just like IMPLIES/IMPLIEDBY did in 2VL.  (Commutative operators (like AND, OR and EQV in 2VL) are "symmetric" in this sense to themselves, so no reduction opportunities there).

I'll have to take a guess at the actual number left here, but the next task is then to assign useful and meaningful names to the remaining 2- to 3000.  And hope memorizing them all will be doable for anyone having to study it.

So at this point I'm just going to be charitable to my critic.  Yes that set of 19683 operators can be significantly reduced.  By somewhere roundabouts 84%, which is even a higher percentage than the roundabouts 68% we could achieve in 2VL.  So I'm going to let him study/invent 3000 operator names while I'm building useful stuff using {AND OR NOT XOR IMPLIES}.

Friday, January 12, 2018

Afterthoughts on a data architects meetup

Visited a meetup of data architects yesterday. Main topic for me was the presentation with thoughts on our practices of data modeling, provokingly presented under the title “data modeling must die”. It was a very good talk. It defended ideas that have been mine as well for as long as I can remember. However this post is about a point of disagreement. And another one.
Disagreement 1.
It was claimed that when Codd invented the relational model of data, he also made some serious mistakes. Fair enough, he has. (It may have been the case that many of those mistakes actually only crept in during the later years for reasons and circumstances that were more political than anything else, and that early Codd was even “purer” than the fiercest relational fundamentalist still walking around these days, but that’s another discussion.)
But the mistake being referred to was “inventing the relational model of data on an island”, by which it was meant that his “mistake” was to invent the RM in isolation from other phases of the process of data systems development, such as conceptual modeling.
True, the inventing happened in isolation. But dressing that up as a “mistake” he made is, eurhm, itself a mistake. One that exposes a lack of understanding of the circumstances of the day.
One, it is not even certain imo that “conceptual modeling” as a thing in its own right already existed at the time. Codd’s RM is 1969, Chen ER is 1974 ("An Introduction to Database Systems" even dates it 1976). So how *could* he have included any such thing in his thinking. Here are two quotes from "An Introduction to Database Systems" that are most likely to illustrate accurately how Codd probably even never have come up with the RM if he *truly, genuinely* was "working on an island, separated from any and all of those developer concerns as they typically manifest themselves while working at the conceptual level".
"It is probably obvious to you that the ideas of the E/R approach, or something very close to those ideas, MUST HAVE BEEN (emphasis mine) the informal underpinnings in Codd's mind when he first developed the formal relational model."
"In other words, in order for Codd to have constructed the (formal) relational model in the first place, he MUST HAVE HAD (emphasis mine) some (informal) "useful semantic concepts" in his mind, and those concepts MUST BASICALLY HAVE BEEN (emphasis mine) those of the E/R model, or something very like them."
Readers wanting to read more are referred to chapter 14 of said book, and pg 425 in particular, for the full discussion by Chris Date.
So why did Codd not bother with the stuff at the conceptual level ?  My answer : because he was a mathematician not an engineer.  And as a mathematician, his mindset always led him to want to be able to PIN THINGS DOWN PRECISELY, with "precisely" here carrying the meaning it has when present in the mind of a PhD in mathematics.  Which is quite different from the meaning the word might have in the mind of the average reader of this post.
And at the conceptual level, you never get to "pin things down precisely" AND THAT'S DELIBERATE.
In those days, there was analysis and there was programming. With a *very* thick Chinese Wall between the two, and often even between the people engaging in one of those two activities (at the time it was typically considered outright impossible for any person to be proficient in both). Analysis was done *on paper* and that paperwork got stored in physical binders ending up in a dust-collecting locker. I even doubt Codd ever got to see any such paper analysis work. He did get to see programs written in the “programming” side of things. Because that’s where his job was : in an environment whose prime purpose was to [develop ‘systems’ software to] support programmers in their “technical” side of the story.
Two, Codd never pretended to address the whole of the data systems development process with his RM. The RM was targeted at a very specific and narrow problem he perceived in that process, as it typically went in those days : that of programmers writing procedural code to dig out the data from where it is stored. He just aimed for a system that would permit *programmers* to do their data manipulation *more declaratively* and *less procedurally/mechanically*. Physical data independence. Nothing more than that. And the environmentals that would make such a thing conceivable and feasible in real life. Codd was even perfectly OK with not even considering how the data got into the database ! His first proposal for a data language, Alpha, *did not have INSERT/DELETE/UPDATE* ! He was perfectly fine leaving all those IMS shops as they were and do nothing but add a “mapping layer” so what came out of the mapping layer was just a relational view of data that was internally still “hierarchical”. I could go on and on about this, but my point here is : calling it a “mistake” that someone doesn’t do something he never intended to do in the first place (and possibly even didn’t have any way of knowing that doing it could be useful), is a bit over the edge.
Disagreement 2
It was claimed that “model translations MUST be automatic”. (The supporting argument being something of the ilk “otherwise it won’t happen anyway”.)
True and understandable (that otherwise it won’t happen), but reality won’t adapt itself so easily to management desiderata (“automatic” is management speak for “cheap” and that’s the only thing that matters) merely because management is management.  Humans do if they're not the manager, reality doesn't.  And the reality is that the path from highly conceptual, highly abstract, highly informal to fully specced out to the very last detail, is achieved by *adding stuff*. And *adding stuff* is design decisions taken along the way. And automated processes are very inappropriate for making *design decisions*. (By *adding stuff* I merely mean *add new design information to the set of already available design information*, I do not mean, add new symbols or tokens to an already existing schema or drawing that is already made up in some syntax.)
When can automated systems succeed in making this kind of design decisions ? When very rigid conventions are followed. E.g. when it is okay that *every entity* modeled at the conceptual level eventually also becomes a table in the logical model/database. But that goes entirely counter to the actual purpose of modeling at the *conceptual* level ! If you take such conventions into account at the time you’re doing conceptual-level modeling, then you are deluding yourself because in fact you are actually already modeling at the logical level. Because you are already thinking of the consequences at the logical level of doing things this way or that way. The purpose of conceptual-level modeling is to be able to *communicate*. You want to express the notion that *somewhere somehow* the system is aware of a notion of, say, “customer” that is in some way related to, say, a notion of “order” that our business is about. You *SHOULD NOT NEED TO WORRY* about the *logical details* of that notion of a “customer” if all you want to do is express the fact that these notions exist and are related.
So relatively opposite to the undoubtedly wise people in front of the audience, I’m rather inclined to conjecture that if you try to do those “model translations” automatically, you are depriving yourself of the freedom to take those design decisions that are the “right” ones for the context at hand, because the only design decisions that *can* still be taken are *[hardcoded] in [the implementation of]* the translation process. And such a translation process can *never* understand the context (central bank vs. small shop on the corner of the street kind of aspects), let alone take it into account, in the same way that a human designer can indeed. That is, you are depriving yourself of the opportunity to come up with the “right” designs.
A third point.
I was also surprised to find how easily even the data architects of the current generation who are genuinely motivated to improve things, seem to have this kind of association that “Codd came up with SQL”. He didn’t and he’d actively turn around in his grave hearing such nonsense (he might also just have given up turning around because it never ends). He came up with the relational model. The *data language* he proposed himself was called Alpha. Between Alpha and SQL, several query languages have seen the light of day, the most notable among them probably being QUEL. SQL is mostly due to what good old Larry did roundabouts 1980. It is relatively safe to assume that, once SQL was out, Codd felt about it much the same way that Dijkstra felt about BASIC and COBOL : that it was the most horrendous abomination ever conceived by a human. But that (neither the fact that the likes of Codd *have* such a denigrating opinion, nor the fact that they’re right) won’t stop adoption.

Monday, July 14, 2014

Conceptual vs. Logical modeling, part IV & conclusion

Other kinds of constraint

Still referring to the example model at

I now want to draw your attention to that very peculiar construct close to the center of the image.  Thing is, I have never seen such a symbol before in a conceptual data diagram, and I suspect the same will hold for most of you.

So what does it express ?  The question alone illustrates an important property of using (any) language to express oneself : if you use a word or a symbol that the audience doesn't know/understand, they won't immediately get your meaning, and they'll have to resort to either guessing your intended meaning from context, or else asking you to clarify explicitly.  And if your models and/or drawings end up just being stored in some documentation library, there's a good chance that the readers won't have you around anymore for directly asking further clarification from the original author.  Leaving "guessing the meaning from context" as the only remaining option.  (As said before, guesswork always has its chance of inducing errors, no matter how unlikely or improbable.)

So, since the original author isn't available for asking questions to, let's just do the guesswork.  I guess that this curious symbol intends to express something that might be termed "exclusive subtyping" (I'm not a great fan of the word "subtyping" in contexts of conceptual modeling but never mind).  It expresses that an asset can be a "Financial" asset, or a "Physical" asset, or an "Information" asset, but never two or more of those three simultaneously.  We already touched on this, slightly, in the discussion of referential integrity : the lines from the three subtypes can be seen as relationships, of which only a single one can exist.  I'm pretty sure at one point or other, you've already run into the following notation to denote this :
! ASSETS          !
!...              !
   |   |   |
   |   |   |
   |   |   +-----------------------+
   |   |                           |
   |   +---------------+           |
   |                   |           |
+------------------+ +-------+ +--------+
! FINANCIAL_ASSETS ! ! ...   ! ! ...    !
+------------------+ +-------+ +--------+
!...               ! ! ...   ! ! ...    !
+------------------+ +-------+ +--------+

And this more or less begs the question, "then what about the cardinalities of those relationships ?".  The point being, the example model doesn't seem to give us any information about this.  Can there be only one or can there be more FINANCIAL_ASSETS entries for each ASSSET ?  Can there be zero FINANCIAL_ASSETS associated with a given ASSET (even if the attributes at the ASSETS level tell us it is indeed a "financial" asset) ?  Can there be only one ASSETS associated with each FINANCIAL_ASSET, or can there be more, or even zero ?  Strictly speaking, no answer to be found in the drawing.  Guesswork needed !

Arbitrarily complex constraints

And beyond these, there are of course the "arbitrarily complex" constraints.

"The percentages of the ingredients for a recipe must sum up to 100 for each individual recipe".  No graphical notation I know of allows to specify that kind of thing.  Nevertheless, rules such as these are indeed also a part of a logical information model.


In general, for each information system that manages (and persists) data for its users, there exists some set of "metadata" (I'm not a big fan of that term, just using it for lack of better here and hence in scare quotes) that conveys ***every*** little bit of information about :

(a) the structure of the user's data being managed
(b) the integrity rules that apply to the user's data being managed

Such a "complete" set of information is what I call a "logical information model".

Anything less than that, i.e. any set of information that could have to leave unanswered any possible developer question concerning either structure of or integrity in some database, necessarily becomes a "conceptual information model" by that logic.  Note that such a definition makes the term "conceptual information model" cover the entire range of information models, from those that "leave out almost nothing compared to a fully specced logical one", to those that "leave out almost everything compared to a fully specced logical one".  Some of the existing notations are deliberately purposed for very high levels of abstraction, some of them are deliberately purposed for allowing to document much finer details.

Thing is, all of the existing notations are purposed for _some_ level of abstraction.  Even ORM with its rich set of symbols that allows for way more expressiveness than plain simple basic ER, has more than enough kinds of things it cannot express (perhaps more on that in a next post).  And hence any information drawn in any of the available notations, must then be regarded as a conceptual model.

Some of them will expose more of the nitty gritty details, and thus perhaps come somewhat closer to the "truly logical" kinds of model, others will expose less detail, thus deliberately staying at the higher abstraction levels, and thus staying closer to what might be considered "truly conceptual".  But none of them allow to specify every last little detail.  And the latter is what a true logical informatin model is all about.

Monday, July 7, 2014

Conceptual vs. logical modeling, part III

Conceptual vs logical modeling of Referential Integrity

In the previous post, a somewhat detailed inspection was made of :

(a) how popular modeling approaches support the documenting of a very common class of database constraints, categorizable as "uniqueness constraints"
(b) how those modeling approaches get into problems when the task is to document _all_ the uniqueness rules that might apply to some component of some design (and how in practice this effectively leads to under-documenting the information model)
(c) how those same modeling approaches also get into problems when the task is to document certain other database constraints that also define "uniqueness" of some sort, just not the sort usually thought of in connection with the term "uniqueness".

This post will do the same for another class of constraints of which it is commonly believed that they can be documented reasonably well, i.e. the class of "foreign key" constraints.  Whether that belief is warranted or not, depends of course on your private notion of "reasonable".

Reverting to the "Assets" example model referenced in the initial post of this little series, we see the "Assets" entity has two relationships to "parent" entities.  They respectively express that each Asset "is of some category", and "is of some Asset_type".  (Aside : the explanations in the "Asset_Categories" entity are suspiciously much alike the three "Asset Detail" entities at the bottom, betraying a highly probable redundancy in this model.  But it will make for an interesting consideration wrt views.  End of aside.)

What is _not_ expressed in models such as the one given.

Assets has two distinct relationships to "parent" entities.  That means that there will be _some_ attribute identifying, for each occurrence of an Asset, which Asset_Category and which Asset_Type it belongs to (it was already observed that some dialects omit these attributes from their rectangles, this changes the precise nature of the "problem" here only very slightly).  But what is _not_ formally and explicitly documented here, is _which_ attribute is for expressing _which_ relationship.

Now, in this particular example, of course it is obvious what the true state of affairs is, because the names of the attributes are identical between "child" and "parent" entity.  But this rule/convention as such is certainly not tenable in all situations, most notably in bill-of-material structures :

+---------------+   +----------------------------+
+---------------+   +----------------------------+
! ThingID    ID !--<! ContainingThingID       ID !
+---------------+   ! ContainedThingID        ID !

So _conventions_ will obviously have to be agreed upon to help/guarantee full understanding, or else guesswork may be needed by the reader of such models, and that guesswork constitutes tacit assumptions of some sort, even if there is 99.9999% likelihood the guesswork won't be incorrect.

We've stated that it cannot be documented "which attribute is for expressing which relationship".  A slight moderation is warranted.  It is perfectly possible to convey this information by making the connecting lines "meet the rectangles" precisely at the place where the attribute in question is mentioned inside the rectangle (that is, of course, if it is mentioned there at all).  Kind of like :

+---------------+    +----------------------------+
+---------------+    +----------------------------+
! ThingID    ID !-+-<! ContainingThingID       ID !
+---------------+ +-<! ContainedThingID        ID !

This technique documents reasonably well that the relevant attribute pairs are (ContainingThingID, ThingID) and (ContainedThingID, ThingID).

It will be clear that this will work well only for "singular" (non-composite) FKs, and at any rate even then any possibility of crossing lines or so might lead to some degree of obfuscation.  (Once again, I leave it for you to ponder whether the popular belief that composite keys aren't such a very good idea, is due precisely to these notational problems.  "You shouldn't do that because you can't document it in the drawings.")

Reverting to the original Assets example model, another thing not formally expressed (to the fullest) is the _optionality_ of the relationship.  If the vertical crossing bar at the "Asset_Categories" side of the relationship means, "ALWAYS EXACTLY 1", then there isn't a problem.  But what if the relationship were actually that each "Asset" CAN be of at most one Asset_Category, but perhaps as well OF NONE AT ALL ?  In "almost-relational" systems, this could perhaps at the logical level be addressed by making the corresponding FK nullable, but with truly relational systems, this isn't an option.  In that case, the same technique would have to be applied at the database level as is done for many-to-many relationships : an "intersection" thing would have to be defined that "materializes" the relationship.  But if we do that, where and how do we document the name of this structure that achieves this materialization ?  We could give it its separate rectangle, but this technique is sometimes criticized for creating entity bloat, and is sometimes considered undesirable, claiming it "obfuscates" to some extent the things that the drawn model is _mainly_ intended to convey to the user/reader.

Referential integrity bis

Just as was the case with uniqueness constraints, Foreign Key constraints get their bit of extra "twists" when the Temporal (historical data) dimension is added.  To begin, "temporal foreign key constraints" will almost always, and of necessity, be composite.  Whatever serves as an identifier to identify whatever thingy is relevant, PLUS a range of time.  The aforementioned problems with documenting composite foreign keys apply almost by definition.

Second, (this too is analogous to the situation with uniqueness constraints) for the range attributes that participate in a foreign key, there is a possible distinction to be made between the "traditional", equality-based treatment, and a treatment "for every individual point implied by the range".  And just as was the case with uniqueness constraints, it is even possible for a temporal foreign key to comprise >1 range attribute, and for some of those to have the equality-based treatment (matching rows in the referenced table must have exactly the same range value), while others have the "individual implied points" treatment (matching rows in the referenced table are only required to "cover" the implied points with a range value of their own).  A notation for documenting the distinction is needed, but difficult to conceive.

Referential integrity ter
Referential integrity to a view.  We've mentioned the "Categories of assets" as an aside earlier on.  Presumably, If an Asset is a "Financial_Asset", then its corresponding Asset_Category_Code must have a certain particular value.  Iow, the FK implied by that line from Financial_Asset to Assets, is not really one from the Financial_Assets table to the Assets table, but rather, it is an FK from the Financial_Asets table to a subsetting(/restriction) view of the Assets table, something akin to ... REFERENCES (SELECT ... FROM ASSETS WHERE Asset_Category_Code = ...) ...

The notational problem with our rectangles and connecting lines is completely analogous to the case with "keys on views" : we could document such things by giving the view its own rectangle, and then we've shifted the problem to documenting the definitional connect between two distinct rectangles in the schema, or we can settle for not documenting it at all and hope the info will not be lost in communication, meaning the subsequent readership of our model will have to make all the necessary assumptions, and not a single one more, and make them all correctly.  Might be a tall order.

So once again, it seems we can conclude that for certain classes of referential rule on the data, our graphical ways of modeling can suffice to document such a rule, but they certainly don't suffice to document, clearly, all possible cases of referential rule on data.  The more bits and pieces abstracted away by a graphical modeling language (= the smaller the symbol set of the language), the more conventions will need to be assumed in order to get effective and accurate communication between modeler and reader, and the more cases there will be that are, formally speaking, not documentable because the language's symbols set doesn't allow expressing some nitty gritty detail that "deviates from the usual conventions".

(Still to be continued...)