Integrity
In the previous post, a somewhat detailed inspection was made of possible approaches for specifying, in a modeling language, some database structure, highlighting differences in informational value between those modeling languages, as well as pointing out some pieces of relevant information the specificiation of which is typically left completely unsupported by any of them.
But a full formal spec of a database is not only about its structure, it is also about any additional rules that apply to the constituent components of that structure. This observation holds, regardless of whether the database is a relational one (and its constituent components are what TTM calls "relation variables", "tables" in SQL) or a graph-based or hierarchical one (and its constituent components are nodes and edges). I'll be speaking of relvars (TTM abbreviation for relation variables) in what follows, but keep in mind that the same should apply as well, mutatis mutandis, to hierarchical and "graph-ical" models.
While the aspect of 'structure' can reasonably well be modeled in "graphical" languages (such as the various ER dialects and UML), that is much less the case with the aspect of the integrity constraints between the components of that structure. How come ?
The essential reason is that the nature of an integrity rule/constraint can really be just anything at all. Its "structure" is constrained only by the fact that it must be expressed exclusively in terms of the relvars that make up the database structure. At the logical level, where all the formal details of the relvars have been fully specced out in some given language, this is achieved "easily" enough using some language based on/inspired by mathematics. Just spell out the predicate that makes a violation a violation. But how to devise a language that supports expressing "anything at all" at the conceptual level ? The answer is you can't. The only thing you can do is try to taxonomize the set of all possible constraints in certain well-defined "classes" that might indeed be epxressible. That is (sort of) exactly what has happened in database modeling land (*). From ER modeling over IDEFIX to Halpin ORM : the set of all possible constraints is subsetted according to certain chosen criteria, and for each "identifiable" subset, a notation is devised to facilitate documenting constraints belonging to that subset. A modeling language such as ER leaves it at that, Halpin's "Big Brown Book" explicitly adds a (fourteenth, I believe) category "others", the leftovers that still aren't expressible using the modeling language's symbols that are available.
Anyway. The fact alone that a powerful modeling approach such as ORM still has this category "leftovers" in the constraints realm, should already suffice to show that _complete_ and fully formal specifications are in fact inachievable without a mathematical language. For the typical categories of constraints other than those "leftovers", however, the belief seems to be fairly widespread, and firm, that current mainstream notations suffice to document all the stuff we need to know/convey about our databases. That belief is not entirely warranted, imo, and in this post I'll be illustrating a couple of issues relating to the (common) class of uniqueness constraints. A subsequent post will do the same for referential integrity/foreign key constraints.
(*) the sad byproduct of this state of affairs is that if one uses the word "constraint" in a database discussion, SQL practitioners will often think you must be talking either of a UNIQUE constraint or a foreign key, overlooking the fact that "there are more types of constraint between heaven and earth than are expressible in ER, and supported by SQL".
Uniqueness
The example case from the previous post is not very well suited to illustrate issues with documenting uniqueness rules. None of the entities in that example are directly suitable for illustrating the points I want to make, so for the sake of this discussion, I'm somewhat forced to resort to a totally different example, which is likely to look ridiculous in the eyes of business modelers, but I won't mind that for the time being.
Let's say we want to model the operation of numeric addition - if you really can't bear the thought, imagine you are Euclid or Pythagoras, that arithmetic has not been invented yet and you are in the process of doing exactly that, using the latest database design technology (and with apologies upfront for my ascii modeling) :
+---------------+
! ADDITION !
+---------------+
! N1 number !
! N2 number !
! SUM number !
+---------------+
(Yes, and the table is indeed like
+----+----+-----+
! N1 ! N2 ! SUM !
+----+----+-----+
! 1 ! 1 ! 2 !
! 1 ! 2 ! 3 !
! ... !
! 2 ! 1 ! 3 !
! 2 ! 2 ! 4 !
! ... !
+---------------+
!!!!!!!!)
I'm pretty sure if I'd ask you what the key is here, you'd reply with "N1 and N2 combined, of course". You get 33% from me for that answer. There are three keys here : {N1 N2}, {N1 SUM}, and {N2 SUM}. Well, granted of course that the matter is also important of which ones you ACTUALLY WANT ENFORCED (that's what you're referring to if you wanted to argue that these latter two "do not identify an addition"). If we wanted to "enforce" the obvious consistencies/equivalences between expressions of addition and expressions of subtraction, we would indeed need to model and enforce all three (hehe. settled that.).
Now how would you document all three of them in your model of "annotated rectangles" ? You're in trouble ! (In fact, I think it is precisely _because_ of this notational problem to document multiple composite keys inside one single rectangle, that the notion of "primary" key (as distinct from "secondary / ternary / auxiliary / ..." key ????) has become so widespread as it has, and furthermore that the practice of ID-ifying really just everything has become as popular and widespread as it has. I leave it for you to ponder whether that's a case of putting the cart before the horse or not - or one of redefining/retrofitting the problem such as to suit the most desirable solution.)
Anyway. Supposing we do want to enforce all three keys. How can we document this in our drawing ? Observe in particular that each key consists of >1 attribute and each attribute effectively participates in >1 key. The only way I can imagine to convey all the key information in our rectangle is like this :
+---------------+
! ADDITION !
+---------------+-------+
! N1 number ! K1,K2 !
! N2 number ! K1,K3 !
! SUM number ! K2,K3 !
+---------------+-------+
Very much like the approach of putting a P in front of the attributes, but it takes attentive and careful deciphering to read the drawing and capture the keys information correctly ! (And the sad byproduct of omitting the full keys information, e.g. for readability sake, in diagrams such as these, is indeed that typically not all keys are properly identified, let alone enforced.)
In fact, the most readable way to convey all of this information about uniqueness rules, seems to be exactly by just using syntax very similar to declarative DDL, or the declarative portion of a D language :
UNIQUE {N1,N2} , UNIQUE {N1,SUM} , UNIQUE {N2,SUM} or
UNIQUE { {N1,N2} {N1,SUM} {N2,SUM} }
And here again we are seemingly headed toward a similar conclusion : if you want to be precise _AND_ complete in what you are stating about the nature of the database that you are documenting, then of necessity you MUST resort to a language that has a much higher expressiveness than the ones you typically have available when modeling at a "higher" level of abstraction.
Incidentally, in notations such as Data Vault (a "hub" to represent the entity and separate rectangles connected to the "hub" for each attribute), the problem of documenting keys is even worse. The only graphical way(s) I can imagine to document the existence of some meaningful "grouping" of attributes, such as their belonging to the same key, will invariably make the diagram equally unreadable because of extraneous line bloat. Whether you try to do it by surrounding them with dotted lines or so, or by creating a new symbol for documenting the existence of a key (three extra symbols for the ADDITION Data vault) and connecting each attribute with them as appropriate (six extra connections on the diagram), it's always going to turn your beautiful neatly organized DV diagram into more of a spider web. Fortunately, Data Vault diagrams are typically used only in DW contexts, not to document keys in the operational source systems they're concerned with, but it still goes to show that whatever conceptual notation you use, it only goes as far as it goes and _something_ will always be "missing" from it.
Uniqueness bis
Managing temporal data is somewhat of a long-standing problem in database land. A thorough analysis of the _nature_ of the problem (and what is needed to address it) can be found in "Temporal Data & the Relational Model", pages 1-857, so I'm not going to re-iterate all of that here, but one particular problem dealt with is "temporal uniqueness". (Aside : if you haven't yet read the book but are interested to do so, don't go order or search it now. Updated and revised edition is to appear within a couple of months.)
Say you have
+-----------------+
! ASSET_VALUATION !
+-----------------+
! ASSET_ID ID !
! FROM date !
! TO date !
! VALUE number !
+-----------------+
or
+-----------------+
! MARRIED_TO !
+-----------------+
! PERSON1_ID ID !
! PERSON2_ID ID !
! FROM date !
! TO date !
+-----------------+
and you want to enforce a constraint "no single date giving >1 distinct values for same ASSET_ID", or "no one married to >1 other person on same date".
The "traditional" interpretation of what a key is, will not allow you to express this. No "traditional", equality-based, key will ever prevent overlaps between various FROM-TO combinations for the same person/asset/... So perhaps you might be inclined to conclude "that is not a real key". Interestingly, The "Temporal Data" book proposes to rearrange matters a bit so that expressing the constraint does become possible, and indeed in the form of "specifying a key" at that :
+-------------------+
! ASSET_VALUATION !
+-------------------+
! ASSET_ID ID !
! DURING date_range !
! VALUE number !
+-------------------+
or
+-------------------+
! MARRIED_TO !
+-------------------+
! PERSON1_ID ID !
! PERSON2_ID ID !
! DURING date_range !
+-------------------+
... WHEN UNPACKED ON (DURING) THEN KEY {ASSET_ID DURING} ...
... WHEN UNPACKED ON (DURING) THEN KEY {PERSON1_ID DURING} KEY {PERSON2_ID DURING} ...
The graphical languages such as ER that we typically use for information modeling, still let us down, somewhat, in the case we'd want to specify that level of detail. In addition to the composite nature of the key, we'd also need to express the semantics of the "WHEN UNPACKED ON" part : namely that the values for the DURING attribute must be interpreted in a "for each individual date value covered by the range" kind of way. The closest we could come to denoting that might be something like this :
+-------------------+
! MARRIED_TO !
+-------------------+-----------+
! PERSON1_ID ID ! K1 !
! PERSON2_ID ID ! K2 !
! DURING date_range ! K1_T,K2_T !
+-------------------+-----------+
Of course, the notational problem with denoting multiple keys to the relvar has not disappeared, nor has the possible participation of a single attribute in >1 key, just an extra little bit of codification has been added (suffixing _T to a key name on the lines for the attributes where it applies) to denote the extra bit of semantics covered. It will be clear that while such tricks are indeed possible, and potentially helpful in denoting modeled solutions to a problem that is indeed very common, once again such solutions can only go as far as they do, and taking things further will ultimately result only in making the models we draw ultimately unreadable. Specifically in the context of temporal data management and temporal keys, observe for example that it is not necessarily the case that all range-valued attributes will _always_ have the _T "interpretation" for _all_ the keys in which they participate :
+----------------------------------+
! NOT_BEST_OF_EXAMPLES_BUT_ANYWAY !
+----------------------------------+------+
! PRESIDENTIAL_TERM year_range ! K1 !
! DURING date_range ! K1_T !
! PRESIDENT_NAME ... ! !
+----------------------------------+------+
( think 70-74 : 70-73 : NIXON && 70-74 : 74-74 : FORD )
Once again the conclusion seems warranted that extending expressiveness/notational support beyond current common practices, will quickly result in making the models more unreadable and thus less informative, rather than more informative.
Uniqueness ter
Another variation on the theme of uniqueness rules, is the problem of enforcing uniqueness on only a (proper) subset of all occurrences of an entity type. Say you have
+----------------------------------+
! CAR_LICENSE_PLATE !
+----------------------------------+------+
! CAR_CHASSIS_ID ... ! K1 !
! TAX_LICENSE_PLATE ... ! K2 !
! CAR_STILL_IN_ACTIVE_USE bool ! !
+----------------------------------+------+
and for the purpose of re-using license plate numbers, you want to enforce license plate uniqueness (key K2) only for those cars that are still in active use (there are admittedly better solutions to this problem than the one modeled here, which I sometimes call the ultra-poor man's historical database, but it does serve to illustrate my point).
The aspect of the problem that makes the "key" "not documentable", is precisely the subsetting rule, i.e. the fact that the "key" is not to be enforced on the whole CAR_LICENSE_PLATE entity, but just on the subset of it that, once the database is implemented in SQL, could be found by issuing SELECT ... FROM CAR_LICENSE_PLATE WHERE CAR_STILL_IN_ACTIVE_USE == true;
If we absolutely wanted to be able to document the existence of this key, using the available means of adding a "Kn" annotation in the rectangles, we'd have to add a separate rectangle for the subsetted CAR_LICENSE_PLATE entity, and then we'd have shifted the problem to documenting the definitional connect/dependency of this new rectangle with/on the "original" one, the "full" entity. That is, we've transformed the problem into one of conceptually documenting "view definitions" (and that very idea is probably seriously questionable in itself already because including the "definitional connect" smacks quite a bit of conflating conceptual/logical). Once again, our modeling language will let us go only as far as it goes.
(still to be continued)
The blog title says it all, actually. I intend to use this corner to record my personal take on whatever question or issue comes across.
People who know me and my pet subjects will not find any shocking information, and people who do not know me are perhaps very unlikely to run into this blog, but I don't really care.
I'll still have had the joy of having had my say, audience listening or not :-)
Friday, June 20, 2014
Friday, June 6, 2014
Conceptual vs. Logical modeling (once again)
http://www.databaseanswers.org/data_models/assets/index.htm
was presented to me as an example case in a discussion on [data] modeling at
https://www.linkedin.com/groups/data-model-SAP-Bill-Material-2357895.S.270612382
(group membership is required to view) and in particular as a case for fleshing out the distinctions between what I call 'conceptual' and 'logical' modeling. That distinction being that "conceptual is informal, logical is formal", or "conceptual is typically incomplete, logical is always complete". "Complete" in the sense of "full disclosure of all relevant information". This post intends to clarify further what I mean by that, exactly.
One model being incomplete and another one being complete means there are differences in "informational value". What & where are those differences ? The discussion will be split covering two distinct aspects : one of structure and one of integrity (and the integrity part will be kept for a later post).
What information is present in the rectangle ?
Asset_Category_Code IN ... &&
Asset_Type_Code IN ... &&
Asset_Name IN ... &&
Asset_Description IN ... &&
Date_Acquired IN GREG_CAL_DATE &&
Date_Disposed_of IN GREG_CAL_DATE &&
Other_Details IN ... }
To begin with a language of pure maths. One such as used in "Applied Mathematics for Database Professionals". Very commendable reading, BTW, even if I have to add the footnote that the book doesn't really bother with formal type definitions, contenting itself to rely on a type system such as that offered by Oracle (this had mainly to do with the 9 different DBMS's the authors had been using the most throughout their careers : Oracle 4, Oracle 5, Oracle 6, Oracle 7, Oracle 8, Oracle 9, Oracle 10, Oracle 11 & Oracle 12 - end of jocular sidenote).
Anyways. Math-like language offers everything we need to express precise type definitions and precise definitions of relational structures such as :
FLOAT : { (m,e) | m IN NN && e in NN && e>=-128 && e<=127 && m>=... && m<=...}
COORDINATE : {(x,y,z) | x IN FLOAT && y IN FLOAT && z IN FLOAT}
ASSETS_DISPOSED : { (Asset_ID, Date_Disposed_of) | ... }
But we're left in a bit of trouble when wanting to express external predicates for our relational constructs in such a language. Of necessity, of course, the thing is termed "external" for good reason (external = external to the system of mathematical computation that is the DBMS, hence it's a bit contradictive to expect them to be expressible in math language) !
And those not well versed in using math formulae, will of course quibble that they don't see themselves manipulating models expressed in such a language, and that they want an alternative. Such an alternative exists in the form of a subset of the statements of a programming language such as Tutorial D, in particular, the set of statements in that language that most developers will be inclined to label "declarative" statements (following examples are only loosely inspired by, and not 100% valid, Tutorial D) :
TYPE FLOAT (M INT, E INT) CONSTRAINT E>=-128 && E<=127 && M>=... && M<=... ;
TYPE GREG_CAL_DATE (D INT, M INT, Y INT) CONSTRAINT ........................................... ;
VAR ASSETS_DISPOSED RELATION {Asset_ID ... , Date_Disposed_of GREG_CAL_DATE } ;
Documenting the external predicate for all the VARs (= the relational structures to make up the database, "tables" in SQL) is a matter of adding comments, or (better), some kind of PREDICATE subclause in syntaxes such as the one used here as an example.
The nice thing about such an approach is that these kinds of formal spec are parseable. In a language such as Tutorial D, it means the logical definition of the database structure could be made known to any program by a mere "import logical_model" directive. In environments using other languages, it means that stuff such as Hibernate classes can be generated 100% automagically.
And just to show that the syntax (the "how") of the language is, actually, completely irrelevant, and the only thing that matters is the meaning of the information it conveys (the "what"), here's an example of how the same information could be expressed in a (hypothetical) language that is much more OO-like :
immutable class FLOAT {
int m;
int e;
constraint e>=-128 && e<=127 && m>=... && m<=... ;
}
immutable class COORDINATE {
FLOAT x;
FLOAT y;
FLOAT z;
}
database class ASSETS_DISPOSED {
... Asset_ID;
GREG_CAL_DATE Date_Disposed_of;
predicate "The asset identified by §Asset_ID§ has been disposed of on §Date_Disposed_of§";
}
(If you don't recognize this immediately as an OO-style syntax, imagine "public" in front of every text line in there.) Some of the "weird-looking" stuff ("immutable class"/"database class"/"constraint"/"predicate") in this example is deliberately intended to illustrate the kind of things that currently existing OO languages are typically still lacking in order to make them more suitable for management of data that resides in a DB, or at least for cleaner and better integration between the programming side of things and the data side of things.
To conclude on the matter of "structure" :
Any language can exist to epxress an information model. The degree in which such a language allows to express EVERY POSSIBLE RELEVANT ASPECT of [the content of] an information model, is what determines its degree of suitability for expressing information models that can legitimately be regarded as fully formal logical models. That scale of suitability is a continuum, but no existing modeling languages/dialects actually achieve the full 100% of the expressiveness that is needed.
I'll be leaving anything to do with integrity for a next post (hopefully).
was presented to me as an example case in a discussion on [data] modeling at
https://www.linkedin.com/groups/data-model-SAP-Bill-Material-2357895.S.270612382
(group membership is required to view) and in particular as a case for fleshing out the distinctions between what I call 'conceptual' and 'logical' modeling. That distinction being that "conceptual is informal, logical is formal", or "conceptual is typically incomplete, logical is always complete". "Complete" in the sense of "full disclosure of all relevant information". This post intends to clarify further what I mean by that, exactly.
One model being incomplete and another one being complete means there are differences in "informational value". What & where are those differences ? The discussion will be split covering two distinct aspects : one of structure and one of integrity (and the integrity part will be kept for a later post).
Structure.
Take a look at a single rectangle in the example model, say "Assets". Disregard the mentions of PK/FK for the time being, they're related to integrity, not to structure per se.What information is present in the rectangle ?
- a rectangle heading telling us that the rest of the rectangle informs us further of the nature of a concept called "Assets".
- a rectangle body telling us that the concept "Assets" is defined to have properties named "Asset_ID", "Asset_Name", ...
- Most notably, the type information for each property. Now, I've seen many similar models that _did_ include this information. Usually limited to the set of type names that were known to be supported by the DBMS that the system was known to be going to be implemented on (so far for DBMS agnostic modeling !). Sometimes the set of type names used would include things such as "COORDINATE". Indicating that some domain/type of that name is supposed to exist, and supposed to be known and understood correctly by the reader. And it's the double "supposed" in there that makes such models informal/incomplete still.
- A very nasty one, and very hideous : optionality !!! Take a look at the Date_Disposed_of property. Is that property going to have an "assigned value" for each and every occurrence/instance of an "Assets" type ??? Presumably not. While it is not invalid per se to introduce some kind of concept of "nullability" at the conceptual level (of entity attributes), the thing is : the logical level's "full disclosure" requirement implies that the diagrams must then show that information !!! (I've seen at least one dialect that used '#'/'*' symbols on the left side in the rectangles to achieve this.) And the second thing is : diagramming languages such as E/R and UML already have a notion of optionality (albeit not for attributes but between entities). And of necessity, it will introduce multiple ways of saying the same thing. Singling out the 'disposed_of' attribute in its own "Assets_Disposed" entity (with the obvious one-to-at-most-one relationship) will do the job, but is often considered "poor practice" because of the "entity bloat" it creates (and the corresponding reduction of opportunity to "inspect just once and get the whole picture"). Otoh, it is precisely what would _need_ to be done to achieve a relational version of the logical information model, since the relational model does not allow [values for] attributes to be missing.
- Also not incorporated in the model shown by this example, but the notion does exist : the distinction between "weak" E/R entities and "strong" E/R entities. There are modeling dialects that would _NOT_ include the "Asset_ID" attribute in the "Asset_Valuations" entity, and the reader is supposed to infer, somehow, that "Asset_Valuation" is indeed a "weak" entity and additionally requires its "parent entity"'s primary key before "an occurrence of it can come into existence". This particular approach induces interpretation ambiguities in two cases : (a) the parent entity has >1 identifying key (solvable only by introducing yet other artefacts such as distinctions between "primary" and "secondary" keys), and (b) the child entity has two relationships (think bill-of-material structures) to the same parent entity (there will have to be two distinctly named attributes, so you can't assume "same name as the primary key attributes in the parent", but the modeling approach of leaving them unmentioned means you can't specify which will be which ... This actually belongs more in the discussion on constraint specification, so I'll pick the subject up there again if I don't forget it.
- Also quite hideous as well as nasty : the meaning of the damn thing !!! Will drawing a rectangle and labeling it "Assets" ensure that the understanding derived from it by some individual, will be _exactly_ the same as that derived by some other individual ? _sufficiently_ the same ? And even if so, how will you know any differences in understanding ? Posing these questions is answering them. Looking at the drawing, all readers will just be nodding yes because with sufficient details abstracted away, your beautiful drawing simply matches their perception of reality. I used to have a catchphrase "After you've abstracted away all the differences, whatever remains is identical."
- All the type information. To begin, that's a complete and fully specced inventory of all the data types that will be used in the rest of the model. And "fully specced" really means "fully specced" here, e.g., just saying "INTEGER" will _not_ be sufficient if there is any risk that any reader might be misinterpreting the range of numbers "covered" by this name. Sometimes it _is_ interesting for a user to know that 100000 is not an INTEGER (because >32767), or to know that -1 is not an INTEGER (because only positive numbers were anticipated). For a central bank deciding to introduce negative interest rates, it might be interesting to know that some of their IT systems had not anticipated this and defined the domain for interest rates something like the range from 0.0000 to 99.9999 ... And for types such as "coordinate", there is nothing in the name to suggest whether these are 2D or 3D coordinates ( (x,y) pairs vs (x,y,z) triples ). Formal completeness requires one to state something like :
COORDINATE : {(x,y,z) | x IN FLOAT && y IN FLOAT && z IN FLOAT}
This definition itself depends on a definition for a thing called FLOAT. This one in turn could be defined as
FLOAT : { (m,e) | m IN NN && e in NN && e>=-128 && e<=127 && m>=... && m<=...}
Now we depend on a definition for NN. It will be clear that somewhere somehow, something inevitably has "got to be given". Fortunately, that something can be as simple as "the set of Natural Numbers" that everyone should know from 2nd grade math classes, or thereabouts. And if misunderstandings and/or communication problems boil down to a lack of agreement/common understanding of what the set of natural numbers is, well then there will be very little any modeling language/methodology could possibly to address that.
- Assuming we are defining a fully formal logical structure according to the relational model (as distinct from "the graph-based model", which may be conceivable/imaginable, but has unfortunately never been elaborated/formally spelled out the same way the RM has been), _all_ the attributes of the relational structures, plus the type they're of (those types having been formally defined in the previous step).
concrete example :
ASSETS : { (Asset_ID, Asset_Category_Code, Asset_Type_Code, Asset_Name, Asset_Description, Date_Acquired, Date_Disposed_of, Other_Details) |
Asset_Category_Code IN ... &&
Asset_Type_Code IN ... &&
Asset_Name IN ... &&
Asset_Description IN ... &&
Date_Acquired IN GREG_CAL_DATE &&
Date_Disposed_of IN GREG_CAL_DATE &&
Other_Details IN ... }
- Still assuming we are defining a fully formal logical structure according to the relational model (such that attributes cannot ever be null), the relational structures will be split out in separate parts whenever some attributes of a conceptual entity are optional.
concrete example :
ASSETS : { (Asset_ID, Asset_Category_Code, Asset_Type_Code, Asset_Name,
Asset_Description, Date_Acquired, Date_Disposed_of, Other_Details) |
Asset_ID in ... &&
Asset_Category_Code IN ... &&
Asset_Type_Code IN ... &&
Asset_Name IN ... &&
Asset_Description IN ... &&
Date_Acquired IN GREG_CAL_DATE &&
Other_Details IN ... }
ASSETS_DISPOSED : { (Asset_ID, Date_Disposed_of) |
Asset_ID in ... &&
Date_Disposed_of IN GREG_CAL_DATE }
- And it will also have to include a precise statement of the so-called "external predicate" for each relational structure so defined. Don't underestimate the importance of this. An SQL table has a very precise intended meaning, and too often I've seen maintenance developers think "oh I can reuse this table for my current purpose", looking exclusively at its structure and ignoring/denying/disregarding the precise, current intended meaning completely. It is in fact because of this associated intended meaning that "re-using existing database tables" is, in principle, the WORST POSSIBLE IDEA a db designer can have. Except if he is 100% certain that the current external predicate matches _exactly_ with the "external predicate" he has to "introduce" into the database for achieving "his current purpose". This is most unlikely to be the case, except if the table is an EAV table, and I've already dealt with why that approach sucks (in 99.9999% of the cases).
(Incidentally, if you were struck by a remarkable resemblance between the way the data type definitions were stated in the foregoing point, and the way the relational structures were stated in the last, it is exactly this aspect of their being related or not to such an "external predicate" that makes the difference. Data type definitions are just that, formal ways to define _values_ that are usable in the _relational structures that will make up the database_ to represent _meaning_. Values in data types do not carry meaning, the relational structures that make up the database do. E.g. "The asset identified by §Asset_ID§ has been disposed of on §Date_Disposed_of§". Note the placeholders in between §§ marks, and that the placeholders correspond 1-1 with the attribute names. Such a phrase could be the "external predicate" for the ASSETS_DISPOSED relational structure, and since it defines the logical meaning of the content that will be held in the concerned relational structure, it should always be an integral part of the logical model defining the database.)
To begin with a language of pure maths. One such as used in "Applied Mathematics for Database Professionals". Very commendable reading, BTW, even if I have to add the footnote that the book doesn't really bother with formal type definitions, contenting itself to rely on a type system such as that offered by Oracle (this had mainly to do with the 9 different DBMS's the authors had been using the most throughout their careers : Oracle 4, Oracle 5, Oracle 6, Oracle 7, Oracle 8, Oracle 9, Oracle 10, Oracle 11 & Oracle 12 - end of jocular sidenote).
Anyways. Math-like language offers everything we need to express precise type definitions and precise definitions of relational structures such as :
FLOAT : { (m,e) | m IN NN && e in NN && e>=-128 && e<=127 && m>=... && m<=...}
COORDINATE : {(x,y,z) | x IN FLOAT && y IN FLOAT && z IN FLOAT}
ASSETS_DISPOSED : { (Asset_ID, Date_Disposed_of) | ... }
But we're left in a bit of trouble when wanting to express external predicates for our relational constructs in such a language. Of necessity, of course, the thing is termed "external" for good reason (external = external to the system of mathematical computation that is the DBMS, hence it's a bit contradictive to expect them to be expressible in math language) !
And those not well versed in using math formulae, will of course quibble that they don't see themselves manipulating models expressed in such a language, and that they want an alternative. Such an alternative exists in the form of a subset of the statements of a programming language such as Tutorial D, in particular, the set of statements in that language that most developers will be inclined to label "declarative" statements (following examples are only loosely inspired by, and not 100% valid, Tutorial D) :
TYPE FLOAT (M INT, E INT) CONSTRAINT E>=-128 && E<=127 && M>=... && M<=... ;
TYPE GREG_CAL_DATE (D INT, M INT, Y INT) CONSTRAINT ........................................... ;
VAR ASSETS_DISPOSED RELATION {Asset_ID ... , Date_Disposed_of GREG_CAL_DATE } ;
Documenting the external predicate for all the VARs (= the relational structures to make up the database, "tables" in SQL) is a matter of adding comments, or (better), some kind of PREDICATE subclause in syntaxes such as the one used here as an example.
The nice thing about such an approach is that these kinds of formal spec are parseable. In a language such as Tutorial D, it means the logical definition of the database structure could be made known to any program by a mere "import logical_model" directive. In environments using other languages, it means that stuff such as Hibernate classes can be generated 100% automagically.
And just to show that the syntax (the "how") of the language is, actually, completely irrelevant, and the only thing that matters is the meaning of the information it conveys (the "what"), here's an example of how the same information could be expressed in a (hypothetical) language that is much more OO-like :
immutable class FLOAT {
int m;
int e;
constraint e>=-128 && e<=127 && m>=... && m<=... ;
}
immutable class COORDINATE {
FLOAT x;
FLOAT y;
FLOAT z;
}
database class ASSETS_DISPOSED {
... Asset_ID;
GREG_CAL_DATE Date_Disposed_of;
predicate "The asset identified by §Asset_ID§ has been disposed of on §Date_Disposed_of§";
}
(If you don't recognize this immediately as an OO-style syntax, imagine "public" in front of every text line in there.) Some of the "weird-looking" stuff ("immutable class"/"database class"/"constraint"/"predicate") in this example is deliberately intended to illustrate the kind of things that currently existing OO languages are typically still lacking in order to make them more suitable for management of data that resides in a DB, or at least for cleaner and better integration between the programming side of things and the data side of things.
To conclude on the matter of "structure" :
Any language can exist to epxress an information model. The degree in which such a language allows to express EVERY POSSIBLE RELEVANT ASPECT of [the content of] an information model, is what determines its degree of suitability for expressing information models that can legitimately be regarded as fully formal logical models. That scale of suitability is a continuum, but no existing modeling languages/dialects actually achieve the full 100% of the expressiveness that is needed.
I'll be leaving anything to do with integrity for a next post (hopefully).
Tuesday, January 29, 2013
"Converting SQL to Relational Algebra", part II
When I wrote the first posts in this blog, it seemed to generate traffic from places I found rather curious, until it occurred to me that search engines are not entirely unlikely to present "Relational Model" as a useful search result to people who are searching for a Relation with a Model ...
Likewise, the former post on the same subject as this one, gave quite some traffic, and it seems not unreasonable to assume that most of that traffic was from students who were looking for info on how to solve "Convert SQL to RA" problems, and who were not looking for my philosophical agonies on why these problems are even being taught ...
As somewhat of an apology to all those disappointed students, and to prevent further disappointments in the future, a concise set of guidelines, in the form of SQL_KEYWORD to RA_OPERATOR mappings, and whatever additional comments I think might be useful.
SELECT maps to (any combination of) PROJECT, RENAME and EXTEND.
If the select clause involves exclusively column names of the table defined in the FROM clause, then PROJECT is all that is involved.
If the select clause involves constructs such as <columnname> AS <othercolumnname>, then a RENAME is involved as well.
If the select clause involves scalar expressions (quantity+1, HOUR(<colname>), ...), then an EXTEND is involved. Note that if such scalar expressions are not followed by an AS <colname> construct, then there is no relational equivalent for this SQL, because the SQL yields an unnamed column, and the relational model and/or the relational algebra do not allow unnamed attributes. Note also that if the expression "quantity + 1" is used in a SELECT clause (and quantity is a column of the table in the FROM clause), then the EXTEND operation by itself will not "project away" the quantity attribute. Don't forget to explicitly PROJECT away such attributes in the RA formulation.
In all cases, observe that Relational Algebra does not allow duplicate rows, but SQL does. If the SQL
could possibly give rise to the same quantity appearing multiple times in the result table, then there simply isn't any relational-algebra equivalent of this query. In such cases, for the query to even have a relational-algebra equivalent, it is required that the SELECT is a SELECT DISTINCT.
FROM maps to CARTESIAN PRODUCT, NATURAL JOIN, or just nothing at all.
It maps to nothing at all if (and I think only if) its comma-separated list of table expressions has cardinality one. In that case, the table expression simply defines the input argument to the "surrounding" [combination of] PROJECT/RENAME/EXTEND.
It maps to NATURAL JOIN between pairs of table expressions (say, T1 and T2), if and only if, for all the columns that have the same name in T1 and T2, there is an equality condition in the WHERE clause (WHERE T1.<colx> = T2.<colx>).
In all other cases, it maps to CARTESIAN PRODUCT. Note that in CARTESIAN PRODUCT, the tables involved are not allowed to have any column names in common. You might need to introduce additional RENAMEs in order to resolve/do away with any such column name overlappings. Also note that whatever you write in front of a dot, does not magically become part of the column name itself. Not in SQL, and, well, relying on it in some RA notation isn't a very attractive idea either.
WHERE maps to RESTRICT, or SEMIMINUS (aka ANTIJOIN), or SEMIJOIN.
It maps to RESTRICT if the WHERE clause involves boolean expressions comparing columns of the table against each other, or against literal values. E.g. WHERE <colname1> != <colname2>, WHERE HOUR(<colname>) BETWEEN 8 AND 18, ...
It maps to SEMIJOIN if the WHERE clause involves an EXISTS(...) construct. The nature of the inner SELECT embedded in that clause will give you the second argument for the SEMIJOIN. Note that this inner SELECT might contain a WHERE clause itself, and that the nature of this WHERE clause may dictate that additional RENAMEs might have to be applied to this second argument. For example,
requires a RENAME to be applied to T2, renaming A to B, in order for RA's SEMIJOIN to expose the same behaviour as the SQL EXISTS(...).
And finally, WHERE maps to an ANTIJOIN if the WHERE clause involves a NOT EXISTS(...) construct. All remarks for SEMIJOIN apply here as well, mutatis mutandis.
If >1 of these constructs are involved in the WHERE clause, then these constructs will have a logical connective between them. Split the overall query in as many parts as are needed (one for the scalar conditions, one for each EXISTS and one for each NOT EXISTS). Each logical connective so eliminated becomes a NATURAL JOIN or INTERSECTION if the connective was AND, and becomes a UNION if the connective was OR. The nesting of these INTERSECTIONs and UNIONs is as dictated by the parentheses and/or the precedence rules between the AND/ORed expressions in the SQL.
GROUP BY ... HAVING ... maps to nothing at all. Or more precisely, if you are studying for an exam, do as your course tells you, and try to forget it as soon as possible after you've passed the exam, because whatever you were told in your course/classes was probably about the biggest crap one could imagine.
The same applies if you don't see any GROUP BY ..., but your SELECT clause has an aggregate function in it, like, say, SELECT SUM(QUANTITY) FROM MYTABLE. That's logically the same as having an "empty" GROUP BY clause.
There do exist definitions/versions of Relational Algebra that can properly handle all this kind of aggregation-like functionality, but it is unlikely that you have been taught these. So, do as your course tells you and forget it ever after.
UNION maps, unsurprisingly, to UNION. But do note that basically the same remark applies as with SELECT/projection : Relational Algebra does not allow duplicate rows. So if your SQL UNION could possibly give rise to duplicate rows, then this UNION query simply does not have any relational-algebra equivalent.
Modern SQL also has operators such as EXCEPT, INTERSECT, NATURAL JOIN, JOIN ...ON ... These map to relational difference, relational intersection, natural join, or some combination of natural joins and cartesian products, mostly in obvious ways.
FROM clause missing ... That's not standard SQL, but some products allow it (and as a consequence, some courses may even teach it). In principle, SELECT statements without a FROM clause are used to denote table literals. Use whatever notation your RA course used for denoting relation literals.
Likewise, the former post on the same subject as this one, gave quite some traffic, and it seems not unreasonable to assume that most of that traffic was from students who were looking for info on how to solve "Convert SQL to RA" problems, and who were not looking for my philosophical agonies on why these problems are even being taught ...
As somewhat of an apology to all those disappointed students, and to prevent further disappointments in the future, a concise set of guidelines, in the form of SQL_KEYWORD to RA_OPERATOR mappings, and whatever additional comments I think might be useful.
SELECT maps to (any combination of) PROJECT, RENAME and EXTEND.
If the select clause involves exclusively column names of the table defined in the FROM clause, then PROJECT is all that is involved.
If the select clause involves constructs such as <columnname> AS <othercolumnname>, then a RENAME is involved as well.
If the select clause involves scalar expressions (quantity+1, HOUR(<colname>), ...), then an EXTEND is involved. Note that if such scalar expressions are not followed by an AS <colname> construct, then there is no relational equivalent for this SQL, because the SQL yields an unnamed column, and the relational model and/or the relational algebra do not allow unnamed attributes. Note also that if the expression "quantity + 1" is used in a SELECT clause (and quantity is a column of the table in the FROM clause), then the EXTEND operation by itself will not "project away" the quantity attribute. Don't forget to explicitly PROJECT away such attributes in the RA formulation.
In all cases, observe that Relational Algebra does not allow duplicate rows, but SQL does. If the SQL
SELECT QUANTITY FROM ...
could possibly give rise to the same quantity appearing multiple times in the result table, then there simply isn't any relational-algebra equivalent of this query. In such cases, for the query to even have a relational-algebra equivalent, it is required that the SELECT is a SELECT DISTINCT.
FROM maps to CARTESIAN PRODUCT, NATURAL JOIN, or just nothing at all.
It maps to nothing at all if (and I think only if) its comma-separated list of table expressions has cardinality one. In that case, the table expression simply defines the input argument to the "surrounding" [combination of] PROJECT/RENAME/EXTEND.
It maps to NATURAL JOIN between pairs of table expressions (say, T1 and T2), if and only if, for all the columns that have the same name in T1 and T2, there is an equality condition in the WHERE clause (WHERE T1.<colx> = T2.<colx>).
In all other cases, it maps to CARTESIAN PRODUCT. Note that in CARTESIAN PRODUCT, the tables involved are not allowed to have any column names in common. You might need to introduce additional RENAMEs in order to resolve/do away with any such column name overlappings. Also note that whatever you write in front of a dot, does not magically become part of the column name itself. Not in SQL, and, well, relying on it in some RA notation isn't a very attractive idea either.
WHERE maps to RESTRICT, or SEMIMINUS (aka ANTIJOIN), or SEMIJOIN.
It maps to RESTRICT if the WHERE clause involves boolean expressions comparing columns of the table against each other, or against literal values. E.g. WHERE <colname1> != <colname2>, WHERE HOUR(<colname>) BETWEEN 8 AND 18, ...
It maps to SEMIJOIN if the WHERE clause involves an EXISTS(...) construct. The nature of the inner SELECT embedded in that clause will give you the second argument for the SEMIJOIN. Note that this inner SELECT might contain a WHERE clause itself, and that the nature of this WHERE clause may dictate that additional RENAMEs might have to be applied to this second argument. For example,
SELECT ... FROM T1 WHERE EXISTS ( SELECT * FROM T2 WHERE T2.A = T1.B)
requires a RENAME to be applied to T2, renaming A to B, in order for RA's SEMIJOIN to expose the same behaviour as the SQL EXISTS(...).
And finally, WHERE maps to an ANTIJOIN if the WHERE clause involves a NOT EXISTS(...) construct. All remarks for SEMIJOIN apply here as well, mutatis mutandis.
If >1 of these constructs are involved in the WHERE clause, then these constructs will have a logical connective between them. Split the overall query in as many parts as are needed (one for the scalar conditions, one for each EXISTS and one for each NOT EXISTS). Each logical connective so eliminated becomes a NATURAL JOIN or INTERSECTION if the connective was AND, and becomes a UNION if the connective was OR. The nesting of these INTERSECTIONs and UNIONs is as dictated by the parentheses and/or the precedence rules between the AND/ORed expressions in the SQL.
GROUP BY ... HAVING ... maps to nothing at all. Or more precisely, if you are studying for an exam, do as your course tells you, and try to forget it as soon as possible after you've passed the exam, because whatever you were told in your course/classes was probably about the biggest crap one could imagine.
The same applies if you don't see any GROUP BY ..., but your SELECT clause has an aggregate function in it, like, say, SELECT SUM(QUANTITY) FROM MYTABLE. That's logically the same as having an "empty" GROUP BY clause.
There do exist definitions/versions of Relational Algebra that can properly handle all this kind of aggregation-like functionality, but it is unlikely that you have been taught these. So, do as your course tells you and forget it ever after.
UNION maps, unsurprisingly, to UNION. But do note that basically the same remark applies as with SELECT/projection : Relational Algebra does not allow duplicate rows. So if your SQL UNION could possibly give rise to duplicate rows, then this UNION query simply does not have any relational-algebra equivalent.
Modern SQL also has operators such as EXCEPT, INTERSECT, NATURAL JOIN, JOIN ...ON ... These map to relational difference, relational intersection, natural join, or some combination of natural joins and cartesian products, mostly in obvious ways.
FROM clause missing ... That's not standard SQL, but some products allow it (and as a consequence, some courses may even teach it). In principle, SELECT statements without a FROM clause are used to denote table literals. Use whatever notation your RA course used for denoting relation literals.
Tuesday, January 8, 2013
A letter by Carl Hewitt
Recently, an article by Erik Meijer entitled "all your databases are belong to us" created a bit of commotion among Relationlanders. One Carl Hewitt subsequently wrote a letter in support of Meijer's article. A few observations and personal thoughts about Hewitt's letter.
Hewitt writes :
"Relational databases have been very useful in practice but are increasingly an obstacle to progress due to the following limitations:"
This clearly implies that mr. Hewitt believes that at least once upon some time, a relational database has actually ever existed somewhere, and he has been able to observe, directly or indirectly, that that database was "useful in practice". Considering the overwhelming evidence that SQL is not relational, I wonder what "relational database" mr. Hewitt has so observed to be "useful in practice". Anyway. Is mr. Hewitt unaware of the difference between "relational technology" and "SQL technology" ? Or does he consider the difference (and its consequences) between the two so meaningless and futile that it does no harm at all to gloss over that difference and speak of SQL as if it were indeed relational ?
Besides. Looking at his arguments in support of his claim (inexpressiveness of the RA, for example), one cannot help but wonder if mr. Hewitt is even aware of the difference between a "database" (that's the word he used) and that thing that most knowledgeable people usually use the term "DBMS" for ... Can "inexpressiveness of the RA" possibly a shortcoming of "an organized collection of data" (if so, how ?), or could it only possibly be a shortcoming of "the software system, the foundations of which are in RA, used to manage the collection of data" ?
What does that tell you about how thorough, accurate and meticulously precise mr. Hewitt tries/bothers to be in his published writings ?
Hewitt writes :
"Inexpressiveness. Relational algebra cannot conveniently express negation or disjunction, much less the generalization/specialization connective required for ontologies;"
Codd's Relational Algebra was proven equivalent to predicate calculus, no ? So that means that Codd's RA can express both negation and disjunction, right ? And subsequent definitions of RA that emerged over time (think, for example, of the addition of a transitive closure operation) did not exactly remove the MINUS operator from the existing algebra, right ? So that indicates that the key operative word in the claim is that vague qualification "conveniently", right ? Using such a word without being precise about its intended meaning, is just cheap handwaving.
Anyway, the RA has MINUS (and its nephew SEMIMINUS, aka antijoin), and Relationlanders have known for over 40 years that this operator perfectly suits the purpose of expressing any predicate like "... and it is not the case that ...". It remains an open question what mr. Hewitt thinks is "inconvenient" about writing things such as "<xpr1> MINUS <xpr2>" in, say, Tutorial D code.
Also, there is nothing to stop anyone from defining an algebra that has a "complement" operation (well, so long as all the domains/types are finite, presumably). This algebraic operation by itself is the exact relational counterpart of the logical operation of negation, taken by itself. Having to actually compute complements will be contraindicated in most circumstances, as it will typically involve actual computation of what is known in relationland as the "universal relation" for a given heading. All of that is probably exactly the reason why Codd did not want to include such a "complement" operation in his algebra.
At any rate, I'm still left wondering what mr. Hewitt's problem is here.
Hewitt writes :
"Inconsistency non-robustness. Inconsistency robustness is information-system performance ..."
Note very carefully that Hewitt's complaint here is that the RM lacks "inconsistency robustness", which he then defines to be a performance characteristic. Performance is not a characteristic of the model. Anyway. Once writers start going down this alley, readers can already suspect the kind of , eurhm, "talk" that is about to follow ...
"... in the face of continually pervasive inconsistencies, a shift from the once-dominant paradigms of inconsistency denial and inconsistency elimination attempting to sweep inconsistencies under the rug. In practice, it is impossible to meet the requirement of the Relational Model that all information must be consistent, but the Relational Model does not process inconsistent information correctly. ..."
"Inconsistent" in the context of data[base] management, means the presence of information that is in violation of some stated rule that is supposed to apply to the database. Or iow, that accepting the "inconsistent" information in the database, makes the proposition that states that the violated rule holds, a false one. Or iow, the proposition that states that the rule holds in the database, is in contradiction with the "inconsistent" information. Or iow, accepting inconsistent information in the database is tantamount to accepting contradictions.
And I've been told that it is a proven property that you can prove really just anything from a contradiction.
In RM, it is "possible" to consider "inconsistent information", just like in 2-valued propositional and/or predicate logic, it is "possible" to consider contradictory propositions/predicates. Querying an RM system that holds "inconsistent information" is like applying the rules of logical reasoning to a set of contradictory axioms/premisses. And blaming the RM for "not processing inconsistent information correctly", is like blaming logic for "not dealing with contradictions correctly" (where by 'correctly', it is implied that it should be something other than just 'false', of course).
"... Attempting to use transactions to remove contradictions from, say, relational medical information is tantamount to a distributed-denial-of-service attack due to the locking required to prevent introduction of new inconsistencies even as contradictions are being removed in the presence of numerous interdependencies;"
Anyone who is familiar with the RM, and also with how it is typically criticized, will recognize this immediately as criticizing the model on grounds of implementation issues (locking and transactions), which are orthogonal to the model. Typical.
Hewitt continues :
"Information loss. Once information is known, it should be known thereafter;"
Should it really ? I dispute that. A requirement to never ever ERASE or DELETE or REMOVE just anything, will inevitably bring us to the point where planet earth does not have enough atoms to store all our stuff.
At any rate, there is absolutely nothing in the RM that prevents any database designer from defining structures that keep a record of "which information was known to the database owner during which period of time" and/or "at which point in time the database owner regarded this particular piece of information as no longer relevant and removed it from his operational system". Even SQL has included features to support such stuff in the new 2011 standard.
And in the end, when to DELETE a piece of information, should be at the user's discretion (if regulatory bounds apply, then that user should of course be staying within those bounds, if that wasn't obvious), not at the model's discretion.
Hewitt still hasn't finished :
"Lack of provenance. All information stored or derived should have provenance;"
There is absolutely nothing in the RM to stop a DBMS user from defining structures that record exactly the kind of "provenance" information that Hewitt is talking about (whatever that may be), there is nothing in the RM to stop a DBMS designer from building facilities to automagically populate such structures, and there is nothing in the RM to stop a DBMS user from using such facilities.
Nor should there be any such thing in the RM.
And on and on it goes :
"Inadequate performance and modularity. SQL lacks performance ..."
So once again he unambiguously states that he thinks that "The RM is obsolete" (that's what the title says) because "SQL lacks performance". OMFG.
"... because it has parallelism but no concurrency abstraction. Needed therefore are languages based on the Actor Model (http://www.robust11.org) to achieve performance, operational expressiveness, and inconsistency robustness. To promote modularity, a programming language type should be an interface that does not name its implementations contra to SQL, which requires taking dependencies on internals."
Is the term "programming language type" used here with the same meaning as the term "data type" in relationland ? If so, then the last sentence seems to demand nothing else than that which relational advocates have been demanding for decades already : that the relational model should and must be "orthogonal to type", that is, that it is not up to the model to prescribe which data types should exist/be supported, and which shouldn't.
Of course, relationlanders have known for a long time already that SQL basically flouts that idea, and that its attempts at supporting -in full- the notion of abstract data types are quite crippled and half-baked. But apparently it does indeed seem to be the case that mr. Hewitt mistakenly equates "the relational model" with SQL.
And Hewitt concludes :
"There is no practical way to repair the Relational Model to remove these limitations."
Well, this is the first time I can agree with something Hewitt says. Sadly, as far as I can tell, "these limitations" are not limitations that derive from the Relational Model, rather they seem to derive from his limited understanding thereof. And indeed there is no repairing a piano when the problem is in its player.
Hewitt writes :
"Relational databases have been very useful in practice but are increasingly an obstacle to progress due to the following limitations:"
This clearly implies that mr. Hewitt believes that at least once upon some time, a relational database has actually ever existed somewhere, and he has been able to observe, directly or indirectly, that that database was "useful in practice". Considering the overwhelming evidence that SQL is not relational, I wonder what "relational database" mr. Hewitt has so observed to be "useful in practice". Anyway. Is mr. Hewitt unaware of the difference between "relational technology" and "SQL technology" ? Or does he consider the difference (and its consequences) between the two so meaningless and futile that it does no harm at all to gloss over that difference and speak of SQL as if it were indeed relational ?
Besides. Looking at his arguments in support of his claim (inexpressiveness of the RA, for example), one cannot help but wonder if mr. Hewitt is even aware of the difference between a "database" (that's the word he used) and that thing that most knowledgeable people usually use the term "DBMS" for ... Can "inexpressiveness of the RA" possibly a shortcoming of "an organized collection of data" (if so, how ?), or could it only possibly be a shortcoming of "the software system, the foundations of which are in RA, used to manage the collection of data" ?
What does that tell you about how thorough, accurate and meticulously precise mr. Hewitt tries/bothers to be in his published writings ?
Hewitt writes :
"Inexpressiveness. Relational algebra cannot conveniently express negation or disjunction, much less the generalization/specialization connective required for ontologies;"
Codd's Relational Algebra was proven equivalent to predicate calculus, no ? So that means that Codd's RA can express both negation and disjunction, right ? And subsequent definitions of RA that emerged over time (think, for example, of the addition of a transitive closure operation) did not exactly remove the MINUS operator from the existing algebra, right ? So that indicates that the key operative word in the claim is that vague qualification "conveniently", right ? Using such a word without being precise about its intended meaning, is just cheap handwaving.
Anyway, the RA has MINUS (and its nephew SEMIMINUS, aka antijoin), and Relationlanders have known for over 40 years that this operator perfectly suits the purpose of expressing any predicate like "... and it is not the case that ...". It remains an open question what mr. Hewitt thinks is "inconvenient" about writing things such as "<xpr1> MINUS <xpr2>" in, say, Tutorial D code.
Also, there is nothing to stop anyone from defining an algebra that has a "complement" operation (well, so long as all the domains/types are finite, presumably). This algebraic operation by itself is the exact relational counterpart of the logical operation of negation, taken by itself. Having to actually compute complements will be contraindicated in most circumstances, as it will typically involve actual computation of what is known in relationland as the "universal relation" for a given heading. All of that is probably exactly the reason why Codd did not want to include such a "complement" operation in his algebra.
At any rate, I'm still left wondering what mr. Hewitt's problem is here.
Hewitt writes :
"Inconsistency non-robustness. Inconsistency robustness is information-system performance ..."
Note very carefully that Hewitt's complaint here is that the RM lacks "inconsistency robustness", which he then defines to be a performance characteristic. Performance is not a characteristic of the model. Anyway. Once writers start going down this alley, readers can already suspect the kind of , eurhm, "talk" that is about to follow ...
"... in the face of continually pervasive inconsistencies, a shift from the once-dominant paradigms of inconsistency denial and inconsistency elimination attempting to sweep inconsistencies under the rug. In practice, it is impossible to meet the requirement of the Relational Model that all information must be consistent, but the Relational Model does not process inconsistent information correctly. ..."
"Inconsistent" in the context of data[base] management, means the presence of information that is in violation of some stated rule that is supposed to apply to the database. Or iow, that accepting the "inconsistent" information in the database, makes the proposition that states that the violated rule holds, a false one. Or iow, the proposition that states that the rule holds in the database, is in contradiction with the "inconsistent" information. Or iow, accepting inconsistent information in the database is tantamount to accepting contradictions.
And I've been told that it is a proven property that you can prove really just anything from a contradiction.
In RM, it is "possible" to consider "inconsistent information", just like in 2-valued propositional and/or predicate logic, it is "possible" to consider contradictory propositions/predicates. Querying an RM system that holds "inconsistent information" is like applying the rules of logical reasoning to a set of contradictory axioms/premisses. And blaming the RM for "not processing inconsistent information correctly", is like blaming logic for "not dealing with contradictions correctly" (where by 'correctly', it is implied that it should be something other than just 'false', of course).
"... Attempting to use transactions to remove contradictions from, say, relational medical information is tantamount to a distributed-denial-of-service attack due to the locking required to prevent introduction of new inconsistencies even as contradictions are being removed in the presence of numerous interdependencies;"
Anyone who is familiar with the RM, and also with how it is typically criticized, will recognize this immediately as criticizing the model on grounds of implementation issues (locking and transactions), which are orthogonal to the model. Typical.
Hewitt continues :
"Information loss. Once information is known, it should be known thereafter;"
Should it really ? I dispute that. A requirement to never ever ERASE or DELETE or REMOVE just anything, will inevitably bring us to the point where planet earth does not have enough atoms to store all our stuff.
At any rate, there is absolutely nothing in the RM that prevents any database designer from defining structures that keep a record of "which information was known to the database owner during which period of time" and/or "at which point in time the database owner regarded this particular piece of information as no longer relevant and removed it from his operational system". Even SQL has included features to support such stuff in the new 2011 standard.
And in the end, when to DELETE a piece of information, should be at the user's discretion (if regulatory bounds apply, then that user should of course be staying within those bounds, if that wasn't obvious), not at the model's discretion.
Hewitt still hasn't finished :
"Lack of provenance. All information stored or derived should have provenance;"
There is absolutely nothing in the RM to stop a DBMS user from defining structures that record exactly the kind of "provenance" information that Hewitt is talking about (whatever that may be), there is nothing in the RM to stop a DBMS designer from building facilities to automagically populate such structures, and there is nothing in the RM to stop a DBMS user from using such facilities.
Nor should there be any such thing in the RM.
And on and on it goes :
"Inadequate performance and modularity. SQL lacks performance ..."
So once again he unambiguously states that he thinks that "The RM is obsolete" (that's what the title says) because "SQL lacks performance". OMFG.
"... because it has parallelism but no concurrency abstraction. Needed therefore are languages based on the Actor Model (http://www.robust11.org) to achieve performance, operational expressiveness, and inconsistency robustness. To promote modularity, a programming language type should be an interface that does not name its implementations contra to SQL, which requires taking dependencies on internals."
Is the term "programming language type" used here with the same meaning as the term "data type" in relationland ? If so, then the last sentence seems to demand nothing else than that which relational advocates have been demanding for decades already : that the relational model should and must be "orthogonal to type", that is, that it is not up to the model to prescribe which data types should exist/be supported, and which shouldn't.
Of course, relationlanders have known for a long time already that SQL basically flouts that idea, and that its attempts at supporting -in full- the notion of abstract data types are quite crippled and half-baked. But apparently it does indeed seem to be the case that mr. Hewitt mistakenly equates "the relational model" with SQL.
And Hewitt concludes :
"There is no practical way to repair the Relational Model to remove these limitations."
Well, this is the first time I can agree with something Hewitt says. Sadly, as far as I can tell, "these limitations" are not limitations that derive from the Relational Model, rather they seem to derive from his limited understanding thereof. And indeed there is no repairing a piano when the problem is in its player.
Wednesday, September 19, 2012
"Converting SQL to Relational Algebra"
It's that time of the year again. Introductory courses in relational data management (I'm a bit reluctant to call them "courses in relational theory") are starting again, all over the world, and consequently, students seeking advice and assistance from "professionals" start posing "Converting SQL to Relational Algebra" questions again on various database-related discussion fora, when they are not able to solve assignments given to them in class, or when they are uncertain their solution is correct.
I would actually like any of the teachers who give such assignments to try and explain to me what purpose they hope to be achieving by giving such assignments. What do students learn from this ? I recently got involved in such a "Converting SQL to Relational Algebra" question, and one remark the student made, in the discussion that ensued, confirms my feeling regarding this : they learn nothing at all, and "Converting SQL to Relational Algebra" is completely pointless, only adding to the confusion instead of resolving it. (The remark was, literally, : "we do not either understand why we are learning this".)
Why is this so ? Well, it's because Relational Algebra is foundational, and SQL is not. That's because Relational Algebra is completely abstract, and SQL is not. Not in the same sense that Relational Algebra is abstract. SQL is abstract with respect to the procedures and algorithms that the implementing engine executes in order to carry out a given SQL command. But that is not quite as "profound" or "foundational" a level of abstraction as the one the Relational Algebra can be said to have.
SQL is just a syntax for expressing formulae of the RA. One possible syntax for it. Of the plethora of distinct possible syntaxes that can be conceived to do the same. And SQL is an awful syntax for that purpose, come to that. That's because initially, SQL was firmly rooted in predicate calculus, not in relational algebra. That SQL can express relational algebra, is only a "coincidental" consequence of the fact that predicate calculus and relational algebra were later proven equivalent. That historical reason is why, for example, relational difference had to be expressed using WHERE NOT EXISTS (...), and EXCEPT was only introduced in the standard decades later.
Granted, if you teach RA first and only then SQL, you are headed for a shit load of questions, "why did they make that syntax so quirky if the algebra is so simple" ? The answer is always the same : poor historical reasons. BTW there's a great book forthcoming that fits this line of teaching perfectly. But at least they could understand.
And the fact still remains that if students can first manage to understand RA, then they can understand any language that implements or supports it. They will then understand that, as far as RA operations are concerned, the difference between SQL and, say, Tutorial D, is a mere matter of syntax. They might understand that the Tutorial D expression 'R1 JOIN R2' is the very same thing as the SQL expressions 'SELECT * FROM R1,R2 WHERE R1.<attr> = R2.<attr>', 'SELECT * FROM R1 JOIN R2 ON ...' and 'SELECT * FROM R1 NATURAL JOIN R2'. And the reason they would understand, is precisely that their thinking would be in RA, not in SQL or any other specific language.
We urgently need to learn our database designers to think of data in terms of sets (sets of tuples, where those sets represent the extension of a logical predicate, and each tuple represents a proposition), and to think of manipulations on that data in terms of relational algebra. Not in terms of tables and SQL.
And it means that the teaching should no longer be about "Converting SQL to relational algebra", but instead about "Converting relational algebra to SQL". The latter, not the former, is the sensible way of doing things.
I would actually like any of the teachers who give such assignments to try and explain to me what purpose they hope to be achieving by giving such assignments. What do students learn from this ? I recently got involved in such a "Converting SQL to Relational Algebra" question, and one remark the student made, in the discussion that ensued, confirms my feeling regarding this : they learn nothing at all, and "Converting SQL to Relational Algebra" is completely pointless, only adding to the confusion instead of resolving it. (The remark was, literally, : "we do not either understand why we are learning this".)
Why is this so ? Well, it's because Relational Algebra is foundational, and SQL is not. That's because Relational Algebra is completely abstract, and SQL is not. Not in the same sense that Relational Algebra is abstract. SQL is abstract with respect to the procedures and algorithms that the implementing engine executes in order to carry out a given SQL command. But that is not quite as "profound" or "foundational" a level of abstraction as the one the Relational Algebra can be said to have.
SQL is just a syntax for expressing formulae of the RA. One possible syntax for it. Of the plethora of distinct possible syntaxes that can be conceived to do the same. And SQL is an awful syntax for that purpose, come to that. That's because initially, SQL was firmly rooted in predicate calculus, not in relational algebra. That SQL can express relational algebra, is only a "coincidental" consequence of the fact that predicate calculus and relational algebra were later proven equivalent. That historical reason is why, for example, relational difference had to be expressed using WHERE NOT EXISTS (...), and EXCEPT was only introduced in the standard decades later.
Granted, if you teach RA first and only then SQL, you are headed for a shit load of questions, "why did they make that syntax so quirky if the algebra is so simple" ? The answer is always the same : poor historical reasons. BTW there's a great book forthcoming that fits this line of teaching perfectly. But at least they could understand.
And the fact still remains that if students can first manage to understand RA, then they can understand any language that implements or supports it. They will then understand that, as far as RA operations are concerned, the difference between SQL and, say, Tutorial D, is a mere matter of syntax. They might understand that the Tutorial D expression 'R1 JOIN R2' is the very same thing as the SQL expressions 'SELECT * FROM R1,R2 WHERE R1.<attr> = R2.<attr>', 'SELECT * FROM R1 JOIN R2 ON ...' and 'SELECT * FROM R1 NATURAL JOIN R2'. And the reason they would understand, is precisely that their thinking would be in RA, not in SQL or any other specific language.
We urgently need to learn our database designers to think of data in terms of sets (sets of tuples, where those sets represent the extension of a logical predicate, and each tuple represents a proposition), and to think of manipulations on that data in terms of relational algebra. Not in terms of tables and SQL.
And it means that the teaching should no longer be about "Converting SQL to relational algebra", but instead about "Converting relational algebra to SQL". The latter, not the former, is the sensible way of doing things.
Tuesday, September 18, 2012
"Why EAV sucks"
We've already seen this one before as an illustration of a cartesian product establishing a tuple type :
and one possible relation deriving from this :
or, rewriting those tuples as sets of pairs mapping attribute names to values of arbitrary domains :
In tabular form (a fairly typical form for representing relation), this could become something like
Well, tables consist of rows and columns, and with this particular view of a relation, both rows and columns turn out to have an interesting property.
Beginning with the rows. It is obvious that one row contains precisely one tuple of the relation. Nothing more and nothing less. And one tuple corresponds to exactly one logical proposition that it represents. That logical proposition being a sentence, a statement of fact. For example, "Yesterday, the average temperature in room A was 1 degrees centigrade.".
Then the columns. What is common for all the things mentioned in the column labeled 'I' ? They are all numbers originating from domain INT. Likewise for the things mentioned in the column labeled 'C'. In our data manipulation language, this means that if we have a reference to column (/attribute) 'C', we just know that what we're referring to is a CHAR value. We don't need to typecheck this at runtime. It's guaranteed by the typechecks that occurred when the tuple was inserted.
Remember these two. They're important.
A closer look at one of these two tuples themselves now.
That's a set of ordered pairs. The first member of each pair coming from the domain of 'valid attribute names', the second coming from ... Well from what exactly ? We can't say that the second member of all pairs here comes from INT, nor from CHAR, nor in fact from any domain we used to build tuple types and relations with. Rather, all we can say is that it's just some value from some domain, but we don't know which it is. (We can know which one it is, by inspecting the value of the first member (the attribute name), and then linking that somehow to the original way in which we defined our cartesian product/tuple type.)
Anyway. A "set of ordered pairs" is a relation, thus a tuple is itself also a relation. (But do keep the distinction in mind between "relation over the domains INT and CHAR" and "relation over the domains ATTRIBUTENAME and SET_UNION_OF_ALL_DOMAINS").
Anf if a tuple is itself also a relation, we could try out what we get if we wanted to represent it tabularly using the usual technique :
Compare with the "interesting properties" of the (tabular representation of) the relation. Does the interesting property of the column still survive ? No it doesn't. Unless we find the fact of possibly getting back just anything from any expression in our data manipulation language an "interesting" property. Does the interesting property of the rows (tuples of a relation) still stand ? No it doesn't. It just tells us that there has been a temperature of 1 degrees centigrade yesterday, in some further unspecified room, for example. (But note that in order to derive so much information, we need to inspect the value of that "first column", and iterate over a set of used attribute names : "if it's C then that value is a room identifier. No it's not C. If it's I then that value is an average temperature. Ah yes that's the one.")
(Aside. The question marks are there for a reason. What would you put in their place ? Can the user of a system based on this paradigm of tuples-as-relations be given the freedom to choose his own preferred replacements for the question marks, in a sense similar to how the designer of a database is at freedom to choose his attribute names, when defining cartesian products aka tuple types ? End-of-aside.)
One step further in the predictable direction. Take a look at
This table was obtained by prepending something very akin to a "tuple id", "entity id" or "object id" to the table, and then using those values for identifying "which attribute values belong together to form a tuple".
Looks like anything you know ? Rhetorical question. This is EAV. Strip the RM from its connection to predicate logic (that is, remove the connection with 'meaning') and strip the RM from the static type checking that is offered by its most foundational building brick, the tuple (that is, remove the most fundamental data integrity feature of all), and what is left is EAV. As another commenter put it : "disassemble the tuple and amputate the predicate".
Some thought exercises. Assume you want to design a database like this. Assume the name in place of the question marks is 'OID'. Write all the needed constraints on this three-column table to effect the following :
- values corresponding to the ATTRIBUTENAME 'I' must be of domain INT.
- values corresponding to the ATTRIBUTENAME 'C' must be of domain CHAR.
- only complete tuples can be inserted. That is, if an I or a C is inserted, then so must the corresponding C or I be.
- only complete tuples can be deleted. That is, if an I or a C is deleted, then so must the corresponding C or I be.
These four constraints all derive naturally from a relational declaration as simple as (two examples)
Which of the two approaches is the easiest and the simplest ?
Further thought exercises.
A constraint must be introduced to the effect that "the I value must be less or equal than the position of the C value in the alphabet." Write a query to check whether this constraint is currently satisfied by your EAV database, and spell out the strategy you're going to follow to enforce this rule. Is it simpler or easier or less work involved than the following ? (two examples)
A constraint must be introduced to the effect that the I values are unique identifiers for any tuple. Write a query to check whether this constraint is currently satisfied by your EAV database, and spell out the strategy you're going to follow to enforce this rule, or write out the declarative constraint that will enforce this in your EAV database. Is it simpler or easier or less work involved than the following ? (two examples)
Further thought exercise. Read this sentence carefully. Now look at the following EAV structure.
I could write that same sentence, in cryptic form, in four lines :
Word 1 in the sentence is Read.
Word 3 in the sentence is sentence.
Word 4 in the sentence is carefully.
Word 2 in the sentence is this.
And EAV even goes beyond that in turning that into 8 lines ! How much work and computation does it involve to reassemble the original sentence from this ?
All of that should be sufficient to illustrate how much veracity there is to the claim that "EAV makes database maintenance easier". Instead :
If you do not care about the data type of (the values in) your database fields, then EAV is your thing.
If you do not care about the meaning of the contents of your database, then EAV is your thing.
If you do not care about the integrity of your data, then EAV is your thing.
(In each of these cases, "do not care" is supposed to also include the notion that "the user doesn't mind paying all those extra hours you put in solving problems that the DBMS has already solved for you".)
I:INT X C:CHAR = {(I:1,C:'A') , (I:1,C:'B') , ...}
and one possible relation deriving from this :
R = { (I:1,C:'A') , (I:1,C:'B') }
or, rewriting those tuples as sets of pairs mapping attribute names to values of arbitrary domains :
R = { {(I,1),(C,'A')} , {(I,1),(C,'B')}}
In tabular form (a fairly typical form for representing relation), this could become something like
I | C |
---|---|
1 | A |
2 | B |
Well, tables consist of rows and columns, and with this particular view of a relation, both rows and columns turn out to have an interesting property.
Beginning with the rows. It is obvious that one row contains precisely one tuple of the relation. Nothing more and nothing less. And one tuple corresponds to exactly one logical proposition that it represents. That logical proposition being a sentence, a statement of fact. For example, "Yesterday, the average temperature in room A was 1 degrees centigrade.".
Then the columns. What is common for all the things mentioned in the column labeled 'I' ? They are all numbers originating from domain INT. Likewise for the things mentioned in the column labeled 'C'. In our data manipulation language, this means that if we have a reference to column (/attribute) 'C', we just know that what we're referring to is a CHAR value. We don't need to typecheck this at runtime. It's guaranteed by the typechecks that occurred when the tuple was inserted.
Remember these two. They're important.
A closer look at one of these two tuples themselves now.
{(I,1),(C,'A')}
That's a set of ordered pairs. The first member of each pair coming from the domain of 'valid attribute names', the second coming from ... Well from what exactly ? We can't say that the second member of all pairs here comes from INT, nor from CHAR, nor in fact from any domain we used to build tuple types and relations with. Rather, all we can say is that it's just some value from some domain, but we don't know which it is. (We can know which one it is, by inspecting the value of the first member (the attribute name), and then linking that somehow to the original way in which we defined our cartesian product/tuple type.)
Anyway. A "set of ordered pairs" is a relation, thus a tuple is itself also a relation. (But do keep the distinction in mind between "relation over the domains INT and CHAR" and "relation over the domains ATTRIBUTENAME and SET_UNION_OF_ALL_DOMAINS").
Anf if a tuple is itself also a relation, we could try out what we get if we wanted to represent it tabularly using the usual technique :
?????? | ?????? |
---|---|
I | 1 |
C | A |
Compare with the "interesting properties" of the (tabular representation of) the relation. Does the interesting property of the column still survive ? No it doesn't. Unless we find the fact of possibly getting back just anything from any expression in our data manipulation language an "interesting" property. Does the interesting property of the rows (tuples of a relation) still stand ? No it doesn't. It just tells us that there has been a temperature of 1 degrees centigrade yesterday, in some further unspecified room, for example. (But note that in order to derive so much information, we need to inspect the value of that "first column", and iterate over a set of used attribute names : "if it's C then that value is a room identifier. No it's not C. If it's I then that value is an average temperature. Ah yes that's the one.")
(Aside. The question marks are there for a reason. What would you put in their place ? Can the user of a system based on this paradigm of tuples-as-relations be given the freedom to choose his own preferred replacements for the question marks, in a sense similar to how the designer of a database is at freedom to choose his attribute names, when defining cartesian products aka tuple types ? End-of-aside.)
One step further in the predictable direction. Take a look at
?????? | ATTRIBUTENAME | VALUE |
---|---|---|
GUID-WXCVBN-AXIO | I | 1 |
GUID-WXCVBN-AXIO | C | A |
GUID-WXCVBN-AX1O | I | 2 |
GUID-WXCVBN-AX1O | C | B |
This table was obtained by prepending something very akin to a "tuple id", "entity id" or "object id" to the table, and then using those values for identifying "which attribute values belong together to form a tuple".
Looks like anything you know ? Rhetorical question. This is EAV. Strip the RM from its connection to predicate logic (that is, remove the connection with 'meaning') and strip the RM from the static type checking that is offered by its most foundational building brick, the tuple (that is, remove the most fundamental data integrity feature of all), and what is left is EAV. As another commenter put it : "disassemble the tuple and amputate the predicate".
Some thought exercises. Assume you want to design a database like this. Assume the name in place of the question marks is 'OID'. Write all the needed constraints on this three-column table to effect the following :
- values corresponding to the ATTRIBUTENAME 'I' must be of domain INT.
- values corresponding to the ATTRIBUTENAME 'C' must be of domain CHAR.
- only complete tuples can be inserted. That is, if an I or a C is inserted, then so must the corresponding C or I be.
- only complete tuples can be deleted. That is, if an I or a C is deleted, then so must the corresponding C or I be.
These four constraints all derive naturally from a relational declaration as simple as (two examples)
VAR RELATION {I:INT C:CHAR} IC ;
CREATE TABLE IC (I INT, C CHAR) ;
Which of the two approaches is the easiest and the simplest ?
Further thought exercises.
A constraint must be introduced to the effect that "the I value must be less or equal than the position of the C value in the alphabet." Write a query to check whether this constraint is currently satisfied by your EAV database, and spell out the strategy you're going to follow to enforce this rule. Is it simpler or easier or less work involved than the following ? (two examples)
CONSTRAINT IC_RULE1 ALL(IC, I<=INDEXOF(C,"ABCDEFG...XYZ")) ;
ALTER TABLE IC CHECK (I<=INDEXOF(C,"ABCDEFG...XYZ")) ;
A constraint must be introduced to the effect that the I values are unique identifiers for any tuple. Write a query to check whether this constraint is currently satisfied by your EAV database, and spell out the strategy you're going to follow to enforce this rule, or write out the declarative constraint that will enforce this in your EAV database. Is it simpler or easier or less work involved than the following ? (two examples)
VAR RELATION {I:INT C:CHAR} IC KEY {I};
ALTER TABLE IC KEY (I) ;
Further thought exercise. Read this sentence carefully. Now look at the following EAV structure.
?????? | ATTRIBUTENAME | VALUE |
---|---|---|
GUID-WXCVBN-AXIO | POS | 1 |
GUID-WXCVBN-AXIO | WORD | Read |
GUID-WXCVBN-AXOO | POS | 3 |
GUID-WXCVBN-AXOO | WORD | sentence |
GUID-WXCVBN-AXO0 | POS | 4 |
GUID-WXCVBN-AXO0 | WORD | carefully |
GUID-WXCVBN-AX1O | POS | 2 |
GUID-WXCVBN-AX1O | WORD | this |
I could write that same sentence, in cryptic form, in four lines :
Word 1 in the sentence is Read.
Word 3 in the sentence is sentence.
Word 4 in the sentence is carefully.
Word 2 in the sentence is this.
And EAV even goes beyond that in turning that into 8 lines ! How much work and computation does it involve to reassemble the original sentence from this ?
All of that should be sufficient to illustrate how much veracity there is to the claim that "EAV makes database maintenance easier". Instead :
If you do not care about the data type of (the values in) your database fields, then EAV is your thing.
If you do not care about the meaning of the contents of your database, then EAV is your thing.
If you do not care about the integrity of your data, then EAV is your thing.
(In each of these cases, "do not care" is supposed to also include the notion that "the user doesn't mind paying all those extra hours you put in solving problems that the DBMS has already solved for you".)
Tuesday, September 11, 2012
"Types, Sets, Tuples, Headings, Relations, a set mess ???"
That's a long title, and it's intended to illustrate how all of that stuff that actually underpins the Relational Model of Data, mathematically speaking, may come across as "one big tangled mess of almost-but-not-quite similar-or-equal concepts all thrown on a big pile".
And the body of this installment is about illustrating how it really is not that "perceived mess".
The following example has been used more than once already in this little corner of the internet.
At this point, I need to take one more step back to basics. The foregoing definition assumed it to be "known" or "agreed-on" what those domains INT and CHAR precisely were, respectively. I.e. what its constituent members are. The thing to note is that those "domains" are sets, and their members are values :
And that's really all it takes to be a type. A type is a set of values. And a value can be whatever we want it to be. If I have a good use for the set of guitar chords, I can choose that to be a type. If I have a good use for the set {cloud, oak, armchair} or {scissors, stone, net}, I can choose that to be a type.
(Aside : of course, in a scenario of information exchange, the parties participating in the exchange need to be in full and perfect agreement on what the types are (which values are in it and which are not). Exchange without such a full agreement is unpleasant surprises in the air and accidents waiting to happen.)
Back to
Obviously, by definition in fact, such Cartesian products are themselves a set. Are they then also a type ? The predictable answer is of course 'yes'. A Cartesian Product is a set of tuples, and if we regard these tuples as being values, then any Cartesian Product of domains constitutes a tuple type.
One step further. Relations were defined to be any subset of this cartesian product. Any set is a subset of itself, thus any cartesian product is also a subset of itself, thus any cartesian product is a relation, and since each cartesian product constitutes a tuple type, at least this "maximal" relation ("maximal" subset of the cartesian product of the domains) can also be said to constitute a tuple type. But all other relations derived from the same cartesian product are also "just sets of tuples". Likewise, really just any relation can be thought of as "constituting a tuple type". This holds true even of the empty relation, even though it's fairly counterintuitive to imagine a type that has no values ... (What could we possibly be using that for ???)
Is all of that important ? The view of "really just any relation" as itself being, or constituting, a tuple type ? Not really. Except perhaps for this. One could "subdivide" the set of all possible subsets of our cartesian product in three distinct "classes". The first class consisting of just the subset that is the original cartesian product itself. The second class being those subsets that can be defined "predicatively", by stating a constraining additional predicate such as, e.g. "I equals 1". And the third class being "the others". Pragmatically speaking, this third class can only be defined "by enumeration". (Formally, there is no real distinction between "by enumeration" and "by predicate", because stating the enumeration is, from the perspective of logic, the very same thing as stating a predicate.)
The third class of subsets is thus characterized by the fact that its definition "looks like", e.g. "It's the tuple type I:INT C:CHAR where either I equals 1 and C equals 'Z', or I equals 5 and C equals 'W'". Which is a predicate, but one that looks suspiciously much like a full enumeration.
The second class of subsets is thus characterized by the fact that its definition "looks like", e.g. "It's the tuple type I:INT C:CHAR where I equals 1". A reasonably concise way to define possibly useful subsets/types.
The first class is thus characterized by the fact that its definition is just simply, "It's the tuple type I:INT C:CHAR". The 'I:INT X C:CHAR' portion is the heading. The (only) tuple type in this class can thus be said to be characterized by its heading alone. All other conceivable tuple types with the same heading need an additional constraining predicate to define which tuples are and are not member of the type. This is the reason why in practice, this particular tuple type is often called "the" tuple type corresponding to some heading (as if there is only one such type). Formally, this is rather incorrect. Many tuple types have the same heading, but the one that has all possible tuples, is the most useful, in practice.
The fact that relations are themselves also tuple types, is the source of quite a bit of confusion. The "design dilemma", for example. (That's the question of whether structures such as 'customer' and 'address' should be conveyed in their own type or not.) It's not a dilemma at all. Defining a Customer or Address class in an OO system constitutes defining a type. The tuple type that corresponds to it is a type too. Whether or not to give the type a name, is another question, and one that will mostly be moot at that.
On to relations. If relations are themselves values, then what is their type ? Well, a type is the set of all its contained values. Hence, a relation type is a set of relations. The construction of such a set of relations can be done using the mathematical concept of powersets. A powerset is the set of all possible subsets of a given set. So if we :
- start with a tuple type such as the one in our running example,
- enumerate all the possible subsets of that tuple type (that's equivalent to enumerating all the possible relations with that heading)
- and collect all those subsets in a new set,
then this new set is our relation type.
This relation type too is characterized by its heading (and its heading alone). As with tuple types, this relation type is not the only relation type of this heading, though. Any subset of the tuple type (e.g. all the tuples that have I>=1, thus ruling out negative numbers), when subjected to the same procedure, will in turn give rise to some relation type (relations with the specified heading and where all its tuples satisfy the predicate that (e.g.) I>=1). Likewise, subsetting the relation type using some predicate (for example, retaining only "the relations that have at least three tuples") gives rise to some other new relation type.
Typically, however, relational systems do not exploit this property of relation and tuple types. Instead they give the user the possibility to specify predicates that constrain which relations and tuples can validly appear in which places. But a system that offered named tuple/relation types, with a constraining predicate, would fundamentally (mathematically) not really be different (in what it allows the user to achieve).
And the body of this installment is about illustrating how it really is not that "perceived mess".
The following example has been used more than once already in this little corner of the internet.
I:INT X C:CHAR = {(I:1,C:'A') , (I:1,C:'B') , ...}
At this point, I need to take one more step back to basics. The foregoing definition assumed it to be "known" or "agreed-on" what those domains INT and CHAR precisely were, respectively. I.e. what its constituent members are. The thing to note is that those "domains" are sets, and their members are values :
INT = {1 , 2 , ...}
CHAR = {'A' , 'B' , ...}
And that's really all it takes to be a type. A type is a set of values. And a value can be whatever we want it to be. If I have a good use for the set of guitar chords, I can choose that to be a type. If I have a good use for the set {cloud, oak, armchair} or {scissors, stone, net}, I can choose that to be a type.
(Aside : of course, in a scenario of information exchange, the parties participating in the exchange need to be in full and perfect agreement on what the types are (which values are in it and which are not). Exchange without such a full agreement is unpleasant surprises in the air and accidents waiting to happen.)
Back to
I:INT X C:CHAR = {(I:1,C:'A') , (I:1,C:'B') , ...}
Obviously, by definition in fact, such Cartesian products are themselves a set. Are they then also a type ? The predictable answer is of course 'yes'. A Cartesian Product is a set of tuples, and if we regard these tuples as being values, then any Cartesian Product of domains constitutes a tuple type.
One step further. Relations were defined to be any subset of this cartesian product. Any set is a subset of itself, thus any cartesian product is also a subset of itself, thus any cartesian product is a relation, and since each cartesian product constitutes a tuple type, at least this "maximal" relation ("maximal" subset of the cartesian product of the domains) can also be said to constitute a tuple type. But all other relations derived from the same cartesian product are also "just sets of tuples". Likewise, really just any relation can be thought of as "constituting a tuple type". This holds true even of the empty relation, even though it's fairly counterintuitive to imagine a type that has no values ... (What could we possibly be using that for ???)
Is all of that important ? The view of "really just any relation" as itself being, or constituting, a tuple type ? Not really. Except perhaps for this. One could "subdivide" the set of all possible subsets of our cartesian product in three distinct "classes". The first class consisting of just the subset that is the original cartesian product itself. The second class being those subsets that can be defined "predicatively", by stating a constraining additional predicate such as, e.g. "I equals 1". And the third class being "the others". Pragmatically speaking, this third class can only be defined "by enumeration". (Formally, there is no real distinction between "by enumeration" and "by predicate", because stating the enumeration is, from the perspective of logic, the very same thing as stating a predicate.)
The third class of subsets is thus characterized by the fact that its definition "looks like", e.g. "It's the tuple type I:INT C:CHAR where either I equals 1 and C equals 'Z', or I equals 5 and C equals 'W'". Which is a predicate, but one that looks suspiciously much like a full enumeration.
The second class of subsets is thus characterized by the fact that its definition "looks like", e.g. "It's the tuple type I:INT C:CHAR where I equals 1". A reasonably concise way to define possibly useful subsets/types.
The first class is thus characterized by the fact that its definition is just simply, "It's the tuple type I:INT C:CHAR". The 'I:INT X C:CHAR' portion is the heading. The (only) tuple type in this class can thus be said to be characterized by its heading alone. All other conceivable tuple types with the same heading need an additional constraining predicate to define which tuples are and are not member of the type. This is the reason why in practice, this particular tuple type is often called "the" tuple type corresponding to some heading (as if there is only one such type). Formally, this is rather incorrect. Many tuple types have the same heading, but the one that has all possible tuples, is the most useful, in practice.
The fact that relations are themselves also tuple types, is the source of quite a bit of confusion. The "design dilemma", for example. (That's the question of whether structures such as 'customer' and 'address' should be conveyed in their own type or not.) It's not a dilemma at all. Defining a Customer or Address class in an OO system constitutes defining a type. The tuple type that corresponds to it is a type too. Whether or not to give the type a name, is another question, and one that will mostly be moot at that.
On to relations. If relations are themselves values, then what is their type ? Well, a type is the set of all its contained values. Hence, a relation type is a set of relations. The construction of such a set of relations can be done using the mathematical concept of powersets. A powerset is the set of all possible subsets of a given set. So if we :
- start with a tuple type such as the one in our running example,
- enumerate all the possible subsets of that tuple type (that's equivalent to enumerating all the possible relations with that heading)
- and collect all those subsets in a new set,
then this new set is our relation type.
This relation type too is characterized by its heading (and its heading alone). As with tuple types, this relation type is not the only relation type of this heading, though. Any subset of the tuple type (e.g. all the tuples that have I>=1, thus ruling out negative numbers), when subjected to the same procedure, will in turn give rise to some relation type (relations with the specified heading and where all its tuples satisfy the predicate that (e.g.) I>=1). Likewise, subsetting the relation type using some predicate (for example, retaining only "the relations that have at least three tuples") gives rise to some other new relation type.
Typically, however, relational systems do not exploit this property of relation and tuple types. Instead they give the user the possibility to specify predicates that constrain which relations and tuples can validly appear in which places. But a system that offered named tuple/relation types, with a constraining predicate, would fundamentally (mathematically) not really be different (in what it allows the user to achieve).
Subscribe to:
Posts (Atom)