Monday, July 7, 2014

Conceptual vs. logical modeling, part III

Conceptual vs logical modeling of Referential Integrity

In the previous post, a somewhat detailed inspection was made of :

(a) how popular modeling approaches support the documenting of a very common class of database constraints, categorizable as "uniqueness constraints"
(b) how those modeling approaches get into problems when the task is to document _all_ the uniqueness rules that might apply to some component of some design (and how in practice this effectively leads to under-documenting the information model)
(c) how those same modeling approaches also get into problems when the task is to document certain other database constraints that also define "uniqueness" of some sort, just not the sort usually thought of in connection with the term "uniqueness".

This post will do the same for another class of constraints of which it is commonly believed that they can be documented reasonably well, i.e. the class of "foreign key" constraints.  Whether that belief is warranted or not, depends of course on your private notion of "reasonable".

Reverting to the "Assets" example model referenced in the initial post of this little series, we see the "Assets" entity has two relationships to "parent" entities.  They respectively express that each Asset "is of some category", and "is of some Asset_type".  (Aside : the explanations in the "Asset_Categories" entity are suspiciously much alike the three "Asset Detail" entities at the bottom, betraying a highly probable redundancy in this model.  But it will make for an interesting consideration wrt views.  End of aside.)

What is _not_ expressed in models such as the one given.

Assets has two distinct relationships to "parent" entities.  That means that there will be _some_ attribute identifying, for each occurrence of an Asset, which Asset_Category and which Asset_Type it belongs to (it was already observed that some dialects omit these attributes from their rectangles, this changes the precise nature of the "problem" here only very slightly).  But what is _not_ formally and explicitly documented here, is _which_ attribute is for expressing _which_ relationship.

Now, in this particular example, of course it is obvious what the true state of affairs is, because the names of the attributes are identical between "child" and "parent" entity.  But this rule/convention as such is certainly not tenable in all situations, most notably in bill-of-material structures :

+---------------+   +----------------------------+
+---------------+   +----------------------------+
! ThingID    ID !--<! ContainingThingID       ID !
+---------------+   ! ContainedThingID        ID !

So _conventions_ will obviously have to be agreed upon to help/guarantee full understanding, or else guesswork may be needed by the reader of such models, and that guesswork constitutes tacit assumptions of some sort, even if there is 99.9999% likelihood the guesswork won't be incorrect.

We've stated that it cannot be documented "which attribute is for expressing which relationship".  A slight moderation is warranted.  It is perfectly possible to convey this information by making the connecting lines "meet the rectangles" precisely at the place where the attribute in question is mentioned inside the rectangle (that is, of course, if it is mentioned there at all).  Kind of like :

+---------------+    +----------------------------+
+---------------+    +----------------------------+
! ThingID    ID !-+-<! ContainingThingID       ID !
+---------------+ +-<! ContainedThingID        ID !

This technique documents reasonably well that the relevant attribute pairs are (ContainingThingID, ThingID) and (ContainedThingID, ThingID).

It will be clear that this will work well only for "singular" (non-composite) FKs, and at any rate even then any possibility of crossing lines or so might lead to some degree of obfuscation.  (Once again, I leave it for you to ponder whether the popular belief that composite keys aren't such a very good idea, is due precisely to these notational problems.  "You shouldn't do that because you can't document it in the drawings.")

Reverting to the original Assets example model, another thing not formally expressed (to the fullest) is the _optionality_ of the relationship.  If the vertical crossing bar at the "Asset_Categories" side of the relationship means, "ALWAYS EXACTLY 1", then there isn't a problem.  But what if the relationship were actually that each "Asset" CAN be of at most one Asset_Category, but perhaps as well OF NONE AT ALL ?  In "almost-relational" systems, this could perhaps at the logical level be addressed by making the corresponding FK nullable, but with truly relational systems, this isn't an option.  In that case, the same technique would have to be applied at the database level as is done for many-to-many relationships : an "intersection" thing would have to be defined that "materializes" the relationship.  But if we do that, where and how do we document the name of this structure that achieves this materialization ?  We could give it its separate rectangle, but this technique is sometimes criticized for creating entity bloat, and is sometimes considered undesirable, claiming it "obfuscates" to some extent the things that the drawn model is _mainly_ intended to convey to the user/reader.

Referential integrity bis

Just as was the case with uniqueness constraints, Foreign Key constraints get their bit of extra "twists" when the Temporal (historical data) dimension is added.  To begin, "temporal foreign key constraints" will almost always, and of necessity, be composite.  Whatever serves as an identifier to identify whatever thingy is relevant, PLUS a range of time.  The aforementioned problems with documenting composite foreign keys apply almost by definition.

Second, (this too is analogous to the situation with uniqueness constraints) for the range attributes that participate in a foreign key, there is a possible distinction to be made between the "traditional", equality-based treatment, and a treatment "for every individual point implied by the range".  And just as was the case with uniqueness constraints, it is even possible for a temporal foreign key to comprise >1 range attribute, and for some of those to have the equality-based treatment (matching rows in the referenced table must have exactly the same range value), while others have the "individual implied points" treatment (matching rows in the referenced table are only required to "cover" the implied points with a range value of their own).  A notation for documenting the distinction is needed, but difficult to conceive.

Referential integrity ter
Referential integrity to a view.  We've mentioned the "Categories of assets" as an aside earlier on.  Presumably, If an Asset is a "Financial_Asset", then its corresponding Asset_Category_Code must have a certain particular value.  Iow, the FK implied by that line from Financial_Asset to Assets, is not really one from the Financial_Assets table to the Assets table, but rather, it is an FK from the Financial_Asets table to a subsetting(/restriction) view of the Assets table, something akin to ... REFERENCES (SELECT ... FROM ASSETS WHERE Asset_Category_Code = ...) ...

The notational problem with our rectangles and connecting lines is completely analogous to the case with "keys on views" : we could document such things by giving the view its own rectangle, and then we've shifted the problem to documenting the definitional connect between two distinct rectangles in the schema, or we can settle for not documenting it at all and hope the info will not be lost in communication, meaning the subsequent readership of our model will have to make all the necessary assumptions, and not a single one more, and make them all correctly.  Might be a tall order.

So once again, it seems we can conclude that for certain classes of referential rule on the data, our graphical ways of modeling can suffice to document such a rule, but they certainly don't suffice to document, clearly, all possible cases of referential rule on data.  The more bits and pieces abstracted away by a graphical modeling language (= the smaller the symbol set of the language), the more conventions will need to be assumed in order to get effective and accurate communication between modeler and reader, and the more cases there will be that are, formally speaking, not documentable because the language's symbols set doesn't allow expressing some nitty gritty detail that "deviates from the usual conventions".

(Still to be continued...)

No comments:

Post a Comment