Friday, June 6, 2014

Conceptual vs. Logical modeling (once again)

http://www.databaseanswers.org/data_models/assets/index.htm

was presented to me as an example case in a discussion on [data] modeling at

https://www.linkedin.com/groups/data-model-SAP-Bill-Material-2357895.S.270612382

(group membership is required to view) and in particular as a case for fleshing out the distinctions between what I call 'conceptual' and 'logical' modeling.  That distinction being that "conceptual is informal, logical is formal", or "conceptual is typically incomplete, logical is always complete".  "Complete" in the sense of "full disclosure of all relevant information".  This post intends to clarify further what I mean by that, exactly.

One model being incomplete and another one being complete means there are differences in "informational value".  What & where are those differences ?  The discussion will be split covering two distinct aspects : one of structure and one of integrity (and the integrity part will be kept for a later post).

Structure.

Take a look at a single rectangle in the example model, say "Assets".  Disregard the mentions of PK/FK for the time being, they're related to integrity, not to structure per se.

What information is present in the rectangle ?
  • a rectangle heading telling us that the rest of the rectangle informs us further of the nature of a concept called "Assets".
  • a rectangle body telling us that the concept "Assets" is defined to have properties named "Asset_ID", "Asset_Name", ...
What information is _not_ present in the rectangle ?
  • Most notably, the type information for each property.  Now, I've seen many similar models that _did_ include this information.  Usually limited to the set of type names that were known to be supported by the DBMS that the system was known to be going to be implemented on (so far for DBMS agnostic modeling !).  Sometimes the set of type names used would include things such as "COORDINATE".  Indicating that some domain/type of that name is supposed to exist, and supposed to be known and understood correctly by the reader.  And it's the double "supposed" in there that makes such models informal/incomplete still.
  • A very nasty one, and very hideous : optionality !!!  Take a look at the Date_Disposed_of property.  Is that property going to have an "assigned value" for each and every occurrence/instance of an "Assets" type ???  Presumably not.  While it is not invalid per se to introduce some kind of concept of "nullability" at the conceptual level (of entity attributes), the thing is : the logical level's "full disclosure" requirement implies that the diagrams must then show that information !!!  (I've seen at least one dialect that used '#'/'*' symbols on the left side in the rectangles to achieve this.)  And the second thing is : diagramming languages such as E/R and UML already have a notion of optionality (albeit not for attributes but between entities).  And of necessity, it will introduce multiple ways of saying the same thing.  Singling out the 'disposed_of' attribute in its own "Assets_Disposed" entity (with the obvious one-to-at-most-one relationship) will do the job, but is often considered "poor practice" because of the "entity bloat" it creates (and the corresponding reduction of opportunity to "inspect just once and get the whole picture").  Otoh, it is precisely what would _need_ to be done to achieve a relational version of the logical information model, since the relational model does not allow [values for] attributes to be missing.
  • Also not incorporated in the model shown by this example, but the notion does exist : the distinction between "weak" E/R entities and "strong" E/R entities.  There are modeling dialects that would _NOT_ include the "Asset_ID" attribute in the "Asset_Valuations" entity, and the reader is supposed to infer, somehow, that "Asset_Valuation" is indeed a "weak" entity and additionally requires its "parent entity"'s primary key before "an occurrence of it can come into existence".  This particular approach induces interpretation ambiguities in two cases : (a) the parent entity has >1 identifying key (solvable only by introducing yet other artefacts such as distinctions between "primary" and "secondary" keys), and (b) the child entity has two relationships (think bill-of-material structures) to the same parent entity (there will have to be two distinctly named attributes, so you can't assume "same name as the primary key attributes in the parent", but the modeling approach of leaving them unmentioned means you can't specify which will be which ...  This actually belongs more in the discussion on constraint specification, so I'll pick the subject up there again if I don't forget it.
  • Also quite hideous as well as nasty : the meaning of the damn thing !!!  Will drawing a rectangle and labeling it "Assets" ensure that the understanding derived from it by some individual, will be _exactly_ the same as that derived by some other individual ?  _sufficiently_ the same ?  And even if so, how will you know any differences in understanding ?  Posing these questions is answering them.  Looking at the drawing, all readers will just be nodding yes because with sufficient details abstracted away, your beautiful drawing simply matches their perception of reality.  I used to have a catchphrase "After you've abstracted away all the differences, whatever remains is identical."
Conclusion : while it is OK for conceptual models to leave out information such as the things mentioned (note that I do not claim completeness for the list of things I mentioned), a fully formal logical model will always have to include _all_ the pieces of the "puzzle" :
  • All the type information.  To begin, that's a complete and fully specced inventory of all the data types that will be used in the rest of the model.  And "fully specced" really means "fully specced" here, e.g., just saying "INTEGER" will _not_ be sufficient if there is any risk that any reader might be misinterpreting the range of numbers "covered" by this name.  Sometimes it _is_ interesting for a user to know that 100000 is not an INTEGER (because >32767), or to know that -1 is not an INTEGER (because only positive numbers were anticipated).  For a central bank deciding to introduce negative interest rates, it might be interesting to know that some of their IT systems had not anticipated this and defined the domain for interest rates something like the range from 0.0000 to 99.9999 ...  And for types such as "coordinate", there is nothing in the name to suggest whether these are 2D or 3D coordinates ( (x,y) pairs vs (x,y,z) triples ).  Formal completeness requires one to state something like :

    COORDINATE : {(x,y,z) | x IN FLOAT && y IN FLOAT && z IN FLOAT}

    This definition itself depends on a definition for a thing called FLOAT.  This one in turn could be defined as

    FLOAT : { (m,e) | m IN NN && e in NN && e>=-128 && e<=127 && m>=... && m<=...}

    Now we depend on a definition for NN.  It will be clear that somewhere somehow, something inevitably has "got to be given".  Fortunately, that something can be as simple as "the set of Natural Numbers" that everyone should know from 2nd grade math classes, or thereabouts.  And if misunderstandings and/or communication problems boil down to a lack of agreement/common understanding of what the set of natural numbers is, well then there will be very little any modeling language/methodology could possibly to address that.
  • Assuming we are defining a fully formal logical structure according to the relational model (as distinct from "the graph-based model", which may be conceivable/imaginable, but has unfortunately never been elaborated/formally spelled out the same way the RM has been), _all_ the attributes of the relational structures, plus the type they're of (those types having been formally defined in the previous step).

    concrete example :

    ASSETS : { (Asset_ID, Asset_Category_Code, Asset_Type_Code, Asset_Name, Asset_Description, Date_Acquired, Date_Disposed_of, Other_Details) |
            Asset_ID in ... &&
            Asset_Category_Code IN ... &&
            Asset_Type_Code IN ... &&
            Asset_Name IN ... &&
            Asset_Description IN ... &&
            Date_Acquired IN GREG_CAL_DATE &&
            Date_Disposed_of IN GREG_CAL_DATE &&
            Other_Details IN ... }
  • Still assuming we are defining a fully formal logical structure according to the relational model (such that attributes cannot ever be null), the relational structures will be split out in separate parts whenever some attributes of a conceptual entity are optional.

    concrete example :

    ASSETS : { (Asset_ID, Asset_Category_Code, Asset_Type_Code, Asset_Name,
                        Asset_Description, Date_Acquired, Date_Disposed_of, Other_Details) |
                Asset_ID in ... &&
                Asset_Category_Code IN ... &&
                Asset_Type_Code IN ... &&
                Asset_Name IN ... &&
                Asset_Description IN ... &&
                Date_Acquired IN GREG_CAL_DATE &&
                Other_Details IN ... }
    ASSETS_DISPOSED : { (Asset_ID, Date_Disposed_of) |
                 Asset_ID in ... &&
                 Date_Disposed_of IN GREG_CAL_DATE }
  • And it will also have to include a precise statement of the so-called "external predicate" for each relational structure so defined.  Don't underestimate the importance of this.  An SQL table has a very precise intended meaning, and too often I've seen maintenance developers think "oh I can reuse this table for my current purpose", looking exclusively at its structure and ignoring/denying/disregarding the precise, current intended meaning completely.  It is in fact because of this associated intended meaning that "re-using existing database tables" is, in principle, the WORST POSSIBLE IDEA a db designer can have.  Except if he is 100% certain that the current external predicate matches _exactly_ with the "external predicate" he has to "introduce" into the database for achieving "his current purpose".  This is most unlikely to be the case, except if the table is an EAV table, and I've already dealt with why that approach sucks (in 99.9999% of the cases).

    (Incidentally, if you were struck by a remarkable resemblance between the way the data type definitions were stated in the foregoing point, and the way the relational structures were stated in the last, it is exactly this aspect of their being related or not to such an "external predicate" that makes the difference.  Data type definitions are just that, formal ways to define _values_ that are usable in the _relational structures that will make up the database_ to represent _meaning_.  Values in data types do not carry meaning, the relational structures that make up the database do.  E.g. "The asset identified by §Asset_ID§ has been disposed of on §Date_Disposed_of§".  Note the placeholders in between §§ marks, and that the placeholders correspond 1-1 with the attribute names.  Such a phrase could be the "external predicate" for the ASSETS_DISPOSED relational structure, and since it defines the logical meaning of the content that will be held in the concerned relational structure, it should always be an integral part of the logical model defining the database.)
Next question : and what kind of modeling language supports all of this ?  There are several.  Or none at all.  Depending on what you're willing to call a "modeling language".

To begin with a language of pure maths.  One such as used in "Applied Mathematics for Database Professionals".  Very commendable reading, BTW, even if I have to add the footnote that the book doesn't really bother with formal type definitions, contenting itself to rely on a type system such as that offered by Oracle (this had mainly to do with the 9 different DBMS's the authors had been using the most throughout their careers : Oracle 4, Oracle 5, Oracle 6, Oracle 7, Oracle 8, Oracle 9, Oracle 10, Oracle 11 & Oracle 12 - end of jocular sidenote).

Anyways.  Math-like language offers everything we need to express precise type definitions and precise definitions of relational structures such as :

FLOAT : { (m,e) | m IN NN && e in NN && e&gt;=-128 && e&lt;=127 && m&gt;=... && m&lt;=...}
COORDINATE : {(x,y,z) | x IN FLOAT && y IN FLOAT && z IN FLOAT}

ASSETS_DISPOSED : { (Asset_ID, Date_Disposed_of) | ... }

But we're left in a bit of trouble when wanting to express external predicates for our relational constructs in such a language.  Of necessity, of course, the thing is termed "external" for good reason (external = external to the system of mathematical computation that is the DBMS, hence it's a bit contradictive to expect them to be expressible in math language) !



And those not well versed in using math formulae, will of course quibble that they don't see themselves manipulating models expressed in such a language, and that they want an alternative.  Such an alternative exists in the form of a subset of the statements of a programming language such as Tutorial D, in particular, the set of statements in that language that most developers will be inclined to label "declarative" statements (following examples are only loosely inspired by, and not 100% valid, Tutorial D) :

TYPE FLOAT (M INT, E INT) CONSTRAINT E>=-128 && E<=127 && M>=... && M<=... ;
TYPE GREG_CAL_DATE (D INT, M INT, Y INT) CONSTRAINT ........................................... ;

VAR ASSETS_DISPOSED RELATION {Asset_ID ... , Date_Disposed_of GREG_CAL_DATE } ;

Documenting the external predicate for all the VARs (= the relational structures to make up the database, "tables" in SQL) is a matter of adding comments, or (better), some kind of PREDICATE subclause in syntaxes such as the one used here as an example.

The nice thing about such an approach is that these kinds of formal spec are parseable.  In a language such as Tutorial D, it means the logical definition of the database structure could be made known to any program by a mere "import logical_model" directive.  In environments using other languages, it means that stuff such as Hibernate classes can be generated 100% automagically.

And just to show that the syntax (the "how") of the language is, actually, completely irrelevant, and the only thing that matters is the meaning of the information it conveys (the "what"), here's an example of how the same information could be expressed in a (hypothetical) language that is much more OO-like :

immutable class FLOAT {
  int m;
  int e;
  constraint e&gt;=-128 && e&lt;=127 && m&gt;=... && m&lt;=... ;
}
immutable class COORDINATE {
  FLOAT x;
  FLOAT y;
  FLOAT z;
}
database class ASSETS_DISPOSED {
  ... Asset_ID;
  GREG_CAL_DATE Date_Disposed_of;
  predicate "The asset identified by §Asset_ID§ has been disposed of on §Date_Disposed_of§";
}

(If you don't recognize this immediately as an OO-style syntax, imagine "public" in front of every text line in there.)  Some of the "weird-looking" stuff ("immutable class"/"database class"/"constraint"/"predicate") in this example is deliberately intended to illustrate the kind of things that currently existing OO languages are typically still lacking in order to make them more suitable for management of data that resides in a DB, or at least for cleaner and better integration between the programming side of things and the data side of things.



To conclude on the matter of "structure" :

Any language can exist to epxress an information model.  The degree in which such a language allows to express EVERY POSSIBLE RELEVANT ASPECT of [the content of] an information model, is what determines its degree of suitability for expressing information models that can legitimately be regarded as fully formal logical models.  That scale of suitability is a continuum, but no existing modeling languages/dialects actually achieve the full 100% of the expressiveness that is needed.



I'll be leaving anything to do with integrity for a next post (hopefully).

No comments:

Post a Comment