My personal take on ...: Afterthoughts on a data architects meetup

Visited a meetup of data architects yesterday. Main topic for me was the presentation with thoughts on our practices of data modeling, provokingly presented under the title “data modeling must die”. It was a very good talk. It defended ideas that have been mine as well for as long as I can remember. However this post is about a point of disagreement. And another one.

Disagreement 1.

It was claimed that when Codd invented the relational model of data, he also made some serious mistakes. Fair enough, he has. (It may have been the case that many of those mistakes actually only crept in during the later years for reasons and circumstances that were more political than anything else, and that early Codd was even “purer” than the fiercest relational fundamentalist still walking around these days, but that’s another discussion.)

But the mistake being referred to was “inventing the relational model of data on an island”, by which it was meant that his “mistake” was to invent the RM in isolation from other phases of the process of data systems development, such as conceptual modeling.

True, the inventing happened in isolation. But dressing that up as a “mistake” he made is, eurhm, itself a mistake. One that exposes a lack of understanding of the circumstances of the day.

One, it is not even certain imo that “conceptual modeling” as a thing in its own right already existed at the time. Codd’s RM is 1969, Chen ER is 1974 ("An Introduction to Database Systems" even dates it 1976). So how *could* he have included any such thing in his thinking. Here are two quotes from "An Introduction to Database Systems" that are most likely to illustrate accurately how Codd probably even never have come up with the RM if he *truly, genuinely* was "working on an island, separated from any and all of those developer concerns as they typically manifest themselves while working at the conceptual level".

"It is probably obvious to you that the ideas of the E/R approach, or something very close to those ideas, MUST HAVE BEEN (emphasis mine) the informal underpinnings in Codd's mind when he first developed the formal relational model."

"In other words, in order for Codd to have constructed the (formal) relational model in the first place, he MUST HAVE HAD (emphasis mine) some (informal) "useful semantic concepts" in his mind, and those concepts MUST BASICALLY HAVE BEEN (emphasis mine) those of the E/R model, or something very like them."

Readers wanting to read more are referred to chapter 14 of said book, and pg 425 in particular, for the full discussion by Chris Date.

So why did Codd not bother with the stuff at the conceptual level ? My answer : because he was a mathematician not an engineer. And as a mathematician, his mindset always led him to want to be able to PIN THINGS DOWN PRECISELY, with "precisely" here carrying the meaning it has when present in the mind of a PhD in mathematics. Which is quite different from the meaning the word might have in the mind of the average reader of this post.

And at the conceptual level, you never get to "pin things down precisely" AND THAT'S DELIBERATE.

In those days, there was analysis and there was programming. With a *very* thick Chinese Wall between the two, and often even between the people engaging in one of those two activities (at the time it was typically considered outright impossible for any person to be proficient in both). Analysis was done *on paper* and that paperwork got stored in physical binders ending up in a dust-collecting locker. I even doubt Codd ever got to see any such paper analysis work. He did get to see programs written in the “programming” side of things. Because that’s where his job was : in an environment whose prime purpose was to [develop ‘systems’ software to] support programmers in their “technical” side of the story.

Two, Codd never pretended to address the whole of the data systems development process with his RM. The RM was targeted at a very specific and narrow problem he perceived in that process, as it typically went in those days : that of programmers writing procedural code to dig out the data from where it is stored. He just aimed for a system that would permit *programmers* to do their data manipulation *more declaratively* and *less procedurally/mechanically*. Physical data independence. Nothing more than that. And the environmentals that would make such a thing conceivable and feasible in real life. Codd was even perfectly OK with not even considering how the data got into the database ! His first proposal for a data language, Alpha, *did not have INSERT/DELETE/UPDATE* ! He was perfectly fine leaving all those IMS shops as they were and do nothing but add a “mapping layer” so what came out of the mapping layer was just a relational view of data that was internally still “hierarchical”. I could go on and on about this, but my point here is : calling it a “mistake” that someone doesn’t do something he never intended to do in the first place (and possibly even didn’t have any way of knowing that doing it could be useful), is a bit over the edge.

Disagreement 2

It was claimed that “model translations MUST be automatic”. (The supporting argument being something of the ilk “otherwise it won’t happen anyway”.)

True and understandable (that otherwise it won’t happen), but reality won’t adapt itself so easily to management desiderata (“automatic” is management speak for “cheap” and that’s the only thing that matters) merely because management is management. Humans do if they're not the manager, reality doesn't. And the reality is that the path from highly conceptual, highly abstract, highly informal to fully specced out to the very last detail, is achieved by *adding stuff*. And *adding stuff* is design decisions taken along the way. And automated processes are very inappropriate for making *design decisions*. (By *adding stuff* I merely mean *add new design information to the set of already available design information*, I do not mean, add new symbols or tokens to an already existing schema or drawing that is already made up in some syntax.)

When can automated systems succeed in making this kind of design decisions ? When very rigid conventions are followed. E.g. when it is okay that *every entity* modeled at the conceptual level eventually also becomes a table in the logical model/database. But that goes entirely counter to the actual purpose of modeling at the *conceptual* level ! If you take such conventions into account at the time you’re doing conceptual-level modeling, then you are deluding yourself because in fact you are actually already modeling at the logical level. Because you are already thinking of the consequences at the logical level of doing things this way or that way. The purpose of conceptual-level modeling is to be able to *communicate*. You want to express the notion that *somewhere somehow* the system is aware of a notion of, say, “customer” that is in some way related to, say, a notion of “order” that our business is about. You *SHOULD NOT NEED TO WORRY* about the *logical details* of that notion of a “customer” if all you want to do is express the fact that these notions exist and are related.

So relatively opposite to the undoubtedly wise people in front of the audience, I’m rather inclined to conjecture that if you try to do those “model translations” automatically, you are depriving yourself of the freedom to take those design decisions that are the “right” ones for the context at hand, because the only design decisions that *can* still be taken are *[hardcoded] in [the implementation of]* the translation process. And such a translation process can *never* understand the context (central bank vs. small shop on the corner of the street kind of aspects), let alone take it into account, in the same way that a human designer can indeed. That is, you are depriving yourself of the opportunity to come up with the “right” designs.

A third point.

I was also surprised to find how easily even the data architects of the current generation who are genuinely motivated to improve things, seem to have this kind of association that “Codd came up with SQL”. He didn’t and he’d actively turn around in his grave hearing such nonsense (he might also just have given up turning around because it never ends). He came up with the relational model. The *data language* he proposed himself was called Alpha. Between Alpha and SQL, several query languages have seen the light of day, the most notable among them probably being QUEL. SQL is mostly due to what good old Larry did roundabouts 1980. It is relatively safe to assume that, once SQL was out, Codd felt about it much the same way that Dijkstra felt about BASIC and COBOL : that it was the most horrendous abomination ever conceived by a human. But that (neither the fact that the likes of Codd *have* such a denigrating opinion, nor the fact that they’re right) won’t stop adoption.

2 comments:

UnknownJanuary 14, 2018 at 11:43 PM
very good blog. I will add some comments

First:
Agree, but IMO codd failed the distinction between formal concerns and formalized concerns and somehow though the first better than the last. From a human perspective this is not the case (From a system and mathematical perspective he was right).

Second:
The Theory of Concerns directly attacks the more traditional multi level of representation approach. Since each level has an algebra and that can be quite complex, and algebraic transformation needs to be automated. Note, there is still the big issue of adding concerns in the different levels of representation going forward. I did not address this "concern bulldozer" effect in my talk at all.

Third point:
Agree, I will stress this more in my talks. I do this but in this talk I do not talk about the difference between SQL and relational too much
Fabian PascalMay 20, 2019 at 10:24 AM
All of this was covered in my various posts.

The levels of representation emerged in the 80s, much after the RDM (1969) and E/RM (1976). Neither Codd nor Chen distinguished a conceptual from the logical. The RDM is at the latter with very sparse reference to the former; the E/RM is at the former, with a few logical/relational terms mixed in.

Friday, January 12, 2018

Afterthoughts on a data architects meetup

2 comments: