My personal take on ...: July 2014

Other kinds of constraint

Still referring to the example model at

http://www.databaseanswers.org/data_models/assets/index.htm

I now want to draw your attention to that very peculiar construct close to the center of the image. Thing is, I have never seen such a symbol before in a conceptual data diagram, and I suspect the same will hold for most of you.

So what does it express ? The question alone illustrates an important property of using (any) language to express oneself : if you use a word or a symbol that the audience doesn't know/understand, they won't immediately get your meaning, and they'll have to resort to either guessing your intended meaning from context, or else asking you to clarify explicitly. And if your models and/or drawings end up just being stored in some documentation library, there's a good chance that the readers won't have you around anymore for directly asking further clarification from the original author. Leaving "guessing the meaning from context" as the only remaining option. (As said before, guesswork always has its chance of inducing errors, no matter how unlikely or improbable.)

So, since the original author isn't available for asking questions to, let's just do the guesswork. I guess that this curious symbol intends to express something that might be termed "exclusive subtyping" (I'm not a great fan of the word "subtyping" in contexts of conceptual modeling but never mind). It expresses that an asset can be a "Financial" asset, or a "Physical" asset, or an "Information" asset, but never two or more of those three simultaneously. We already touched on this, slightly, in the discussion of referential integrity : the lines from the three subtypes can be seen as relationships, of which only a single one can exist. I'm pretty sure at one point or other, you've already run into the following notation to denote this :

+-----------------+
! ASSETS          !
+-----------------+
!...              !
+-----------------+
   |   |   |
\-o---o---o-/
   |   |   |
   |   |   +-----------------------+
   |   |                           |
   |   +---------------+           |
   |                   |           |
+------------------+ +-------+ +--------+
! FINANCIAL_ASSETS ! ! ...   ! ! ...    !
+------------------+ +-------+ +--------+
!...               ! ! ...   ! ! ...    !
+------------------+ +-------+ +--------+

And this more or less begs the question, "then what about the cardinalities of those relationships ?". The point being, the example model doesn't seem to give us any information about this. Can there be only one or can there be more FINANCIAL_ASSETS entries for each ASSSET ? Can there be zero FINANCIAL_ASSETS associated with a given ASSET (even if the attributes at the ASSETS level tell us it is indeed a "financial" asset) ? Can there be only one ASSETS associated with each FINANCIAL_ASSET, or can there be more, or even zero ? Strictly speaking, no answer to be found in the drawing. Guesswork needed !

Arbitrarily complex constraints

And beyond these, there are of course the "arbitrarily complex" constraints.

"The percentages of the ingredients for a recipe must sum up to 100 for each individual recipe". No graphical notation I know of allows to specify that kind of thing. Nevertheless, rules such as these are indeed also a part of a logical information model.

Conclusion

In general, for each information system that manages (and persists) data for its users, there exists some set of "metadata" (I'm not a big fan of that term, just using it for lack of better here and hence in scare quotes) that conveys ***every*** little bit of information about :

(a) the structure of the user's data being managed
(b) the integrity rules that apply to the user's data being managed

Such a "complete" set of information is what I call a "logical information model".

Anything less than that, i.e. any set of information that could have to leave unanswered any possible developer question concerning either structure of or integrity in some database, necessarily becomes a "conceptual information model" by that logic. Note that such a definition makes the term "conceptual information model" cover the entire range of information models, from those that "leave out almost nothing compared to a fully specced logical one", to those that "leave out almost everything compared to a fully specced logical one". Some of the existing notations are deliberately purposed for very high levels of abstraction, some of them are deliberately purposed for allowing to document much finer details.

Thing is, all of the existing notations are purposed for _some_ level of abstraction. Even ORM with its rich set of symbols that allows for way more expressiveness than plain simple basic ER, has more than enough kinds of things it cannot express (perhaps more on that in a next post). And hence any information drawn in any of the available notations, must then be regarded as a conceptual model.

Some of them will expose more of the nitty gritty details, and thus perhaps come somewhat closer to the "truly logical" kinds of model, others will expose less detail, thus deliberately staying at the higher abstraction levels, and thus staying closer to what might be considered "truly conceptual". But none of them allow to specify every last little detail. And the latter is what a true logical informatin model is all about.

Conceptual vs logical modeling of Referential Integrity

In the previous post, a somewhat detailed inspection was made of :

(a) how popular modeling approaches support the documenting of a very common class of database constraints, categorizable as "uniqueness constraints"
(b) how those modeling approaches get into problems when the task is to document _all_ the uniqueness rules that might apply to some component of some design (and how in practice this effectively leads to under-documenting the information model)
(c) how those same modeling approaches also get into problems when the task is to document certain other database constraints that also define "uniqueness" of some sort, just not the sort usually thought of in connection with the term "uniqueness".

This post will do the same for another class of constraints of which it is commonly believed that they can be documented reasonably well, i.e. the class of "foreign key" constraints. Whether that belief is warranted or not, depends of course on your private notion of "reasonable".

Reverting to the "Assets" example model referenced in the initial post of this little series, we see the "Assets" entity has two relationships to "parent" entities. They respectively express that each Asset "is of some category", and "is of some Asset_type". (Aside : the explanations in the "Asset_Categories" entity are suspiciously much alike the three "Asset Detail" entities at the bottom, betraying a highly probable redundancy in this model. But it will make for an interesting consideration wrt views. End of aside.)

What is _not_ expressed in models such as the one given.

Assets has two distinct relationships to "parent" entities. That means that there will be _some_ attribute identifying, for each occurrence of an Asset, which Asset_Category and which Asset_Type it belongs to (it was already observed that some dialects omit these attributes from their rectangles, this changes the precise nature of the "problem" here only very slightly). But what is _not_ formally and explicitly documented here, is _which_ attribute is for expressing _which_ relationship.

Now, in this particular example, of course it is obvious what the true state of affairs is, because the names of the attributes are identical between "child" and "parent" entity. But this rule/convention as such is certainly not tenable in all situations, most notably in bill-of-material structures :

+---------------+   +----------------------------+
! THINGY        !--<! CONTAINMENT_OF_THINGIES    !
+---------------+   +----------------------------+
! ThingID    ID !--<! ContainingThingID       ID !
+---------------+   ! ContainedThingID        ID !
                    +----------------------------+
So _conventions_ will obviously have to be agreed upon to help/guarantee full understanding, or else guesswork may be needed by the reader of such models, and that guesswork constitutes tacit assumptions of some sort, even if there is 99.9999% likelihood the guesswork won't be incorrect.

We've stated that it cannot be documented "which attribute is for expressing which relationship". A slight moderation is warranted. It is perfectly possible to convey this information by making the connecting lines "meet the rectangles" precisely at the place where the attribute in question is mentioned inside the rectangle (that is, of course, if it is mentioned there at all). Kind of like :

+---------------+    +----------------------------+
! THINGY        !    ! CONTAINMENT_OF_THINGIES    !
+---------------+    +----------------------------+
! ThingID    ID !-+-<! ContainingThingID       ID !
+---------------+ +-<! ContainedThingID        ID !
                     +----------------------------+

This technique documents reasonably well that the relevant attribute pairs are (ContainingThingID, ThingID) and (ContainedThingID, ThingID).

It will be clear that this will work well only for "singular" (non-composite) FKs, and at any rate even then any possibility of crossing lines or so might lead to some degree of obfuscation. (Once again, I leave it for you to ponder whether the popular belief that composite keys aren't such a very good idea, is due precisely to these notational problems. "You shouldn't do that because you can't document it in the drawings.")

Reverting to the original Assets example model, another thing not formally expressed (to the fullest) is the _optionality_ of the relationship. If the vertical crossing bar at the "Asset_Categories" side of the relationship means, "ALWAYS EXACTLY 1", then there isn't a problem. But what if the relationship were actually that each "Asset" CAN be of at most one Asset_Category, but perhaps as well OF NONE AT ALL ? In "almost-relational" systems, this could perhaps at the logical level be addressed by making the corresponding FK nullable, but with truly relational systems, this isn't an option. In that case, the same technique would have to be applied at the database level as is done for many-to-many relationships : an "intersection" thing would have to be defined that "materializes" the relationship. But if we do that, where and how do we document the name of this structure that achieves this materialization ? We could give it its separate rectangle, but this technique is sometimes criticized for creating entity bloat, and is sometimes considered undesirable, claiming it "obfuscates" to some extent the things that the drawn model is _mainly_ intended to convey to the user/reader.

Referential integrity bis

Just as was the case with uniqueness constraints, Foreign Key constraints get their bit of extra "twists" when the Temporal (historical data) dimension is added. To begin, "temporal foreign key constraints" will almost always, and of necessity, be composite. Whatever serves as an identifier to identify whatever thingy is relevant, PLUS a range of time. The aforementioned problems with documenting composite foreign keys apply almost by definition.

Second, (this too is analogous to the situation with uniqueness constraints) for the range attributes that participate in a foreign key, there is a possible distinction to be made between the "traditional", equality-based treatment, and a treatment "for every individual point implied by the range". And just as was the case with uniqueness constraints, it is even possible for a temporal foreign key to comprise >1 range attribute, and for some of those to have the equality-based treatment (matching rows in the referenced table must have exactly the same range value), while others have the "individual implied points" treatment (matching rows in the referenced table are only required to "cover" the implied points with a range value of their own). A notation for documenting the distinction is needed, but difficult to conceive.

Referential integrity ter
Referential integrity to a view. We've mentioned the "Categories of assets" as an aside earlier on. Presumably, If an Asset is a "Financial_Asset", then its corresponding Asset_Category_Code must have a certain particular value. Iow, the FK implied by that line from Financial_Asset to Assets, is not really one from the Financial_Assets table to the Assets table, but rather, it is an FK from the Financial_Asets table to a subsetting(/restriction) view of the Assets table, something akin to ... REFERENCES (SELECT ... FROM ASSETS WHERE Asset_Category_Code = ...) ...

The notational problem with our rectangles and connecting lines is completely analogous to the case with "keys on views" : we could document such things by giving the view its own rectangle, and then we've shifted the problem to documenting the definitional connect between two distinct rectangles in the schema, or we can settle for not documenting it at all and hope the info will not be lost in communication, meaning the subsequent readership of our model will have to make all the necessary assumptions, and not a single one more, and make them all correctly. Might be a tall order.

So once again, it seems we can conclude that for certain classes of referential rule on the data, our graphical ways of modeling can suffice to document such a rule, but they certainly don't suffice to document, clearly, all possible cases of referential rule on data. The more bits and pieces abstracted away by a graphical modeling language (= the smaller the symbol set of the language), the more conventions will need to be assumed in order to get effective and accurate communication between modeler and reader, and the more cases there will be that are, formally speaking, not documentable because the language's symbols set doesn't allow expressing some nitty gritty detail that "deviates from the usual conventions".

(Still to be continued...)

My personal take on ...

Monday, July 14, 2014

Conceptual vs. Logical modeling, part IV & conclusion

Monday, July 7, 2014

Conceptual vs. logical modeling, part III