Tuesday, September 18, 2012

"Why EAV sucks"

We've already seen this one before as an illustration of a cartesian product establishing a tuple type :

I:INT X C:CHAR = {(I:1,C:'A') , (I:1,C:'B') , ...}

and one possible relation deriving from this :

R = { (I:1,C:'A') , (I:1,C:'B') }

or, rewriting those tuples as sets of pairs mapping attribute names to values of arbitrary domains :

R = { {(I,1),(C,'A')} , {(I,1),(C,'B')}}

In tabular form (a fairly typical form for representing relation), this could become something like

IC
1A
2B

Well, tables consist of rows and columns, and with this particular view of a relation, both rows and columns turn out to have an interesting property.

Beginning with the rows.  It is obvious that one row contains precisely one tuple of the relation.  Nothing more and nothing less.  And one tuple corresponds to exactly one logical proposition that it represents.  That logical proposition being a sentence, a statement of fact.  For example, "Yesterday, the average temperature in room A was 1 degrees centigrade.".

Then the columns.  What is common for all the things mentioned in the column labeled 'I' ?  They are all numbers originating from domain INT.  Likewise for the things mentioned in the column labeled 'C'.  In our data manipulation language, this means that if we have a reference to column (/attribute) 'C', we just know that what we're referring to is a CHAR value.  We don't need to typecheck this at runtime.  It's guaranteed by the typechecks that occurred when the tuple was inserted.

Remember these two.  They're important.

A closer look at one of these two tuples themselves now.

{(I,1),(C,'A')}

That's a set of ordered pairs.  The first member of each pair coming from the domain of 'valid attribute names', the second coming from ...   Well from what exactly ?  We can't say that the second member of all pairs here comes from INT, nor from CHAR, nor in fact from any domain we used to build tuple types and relations with.  Rather, all we can say is that it's just some value from some domain, but we don't know which it is.  (We can know which one it is, by inspecting the value of the first member (the attribute name), and then linking that somehow to the original way in which we defined our cartesian product/tuple type.)

Anyway.  A "set of ordered pairs" is a relation, thus a tuple is itself also a relation.  (But do keep the distinction in mind between "relation over the domains INT and CHAR" and "relation over the domains ATTRIBUTENAME and SET_UNION_OF_ALL_DOMAINS").

Anf if a tuple is itself also a relation, we could try out what we get if we wanted to represent it tabularly using the usual technique :

????????????
I1
CA

Compare with the "interesting properties" of the (tabular representation of) the relation.  Does the interesting property of the column still survive ?  No it doesn't.  Unless we find the fact of possibly getting back just anything from any expression in our data manipulation language an "interesting" property.  Does the interesting property of the rows (tuples of a relation) still stand ?  No it doesn't.  It just tells us that there has been a temperature of 1 degrees centigrade yesterday, in some further unspecified room, for example.  (But note that in order to derive so much information, we need to inspect the value of that "first column", and iterate over a set of used attribute names : "if it's C then that value is a room identifier.  No it's not C.  If it's I then that value is an average temperature.  Ah yes that's the one.")

(Aside.  The question marks are there for a reason.  What would you put in their place ?  Can the user of a system based on this paradigm of tuples-as-relations be given the freedom to choose his own preferred replacements for the question marks, in a sense similar to how the designer of a database is at freedom to choose his attribute names, when defining cartesian products aka tuple types ?  End-of-aside.)

One step further in the predictable direction.  Take a look at

??????ATTRIBUTENAMEVALUE
GUID-WXCVBN-AXIOI1
GUID-WXCVBN-AXIOCA
GUID-WXCVBN-AX1OI2
GUID-WXCVBN-AX1OCB

This table was obtained by prepending something very akin to a "tuple id", "entity id" or "object id" to the table, and then using those values for identifying "which attribute values belong together to form a tuple".

Looks like anything you know ?  Rhetorical question.  This is EAV.  Strip the RM from its connection to predicate logic (that is, remove the connection with 'meaning') and strip the RM from the static type checking that is offered by its most foundational building brick, the tuple (that is, remove the most fundamental data integrity feature of all), and what is left is EAV.  As another commenter put it : "disassemble the tuple and amputate the predicate".

Some thought exercises.  Assume you want to design a database like this.  Assume the name in place of the question marks is 'OID'.  Write all the needed constraints on this three-column table to effect the following :

- values corresponding to the ATTRIBUTENAME 'I' must be of domain INT.
- values corresponding to the ATTRIBUTENAME 'C' must be of domain CHAR.
- only complete tuples can be inserted.  That is, if an I or a C is inserted, then so must the corresponding C or I be.
- only complete tuples can be deleted.  That is, if an I or a C is deleted, then so must the corresponding C or I be.


These four constraints all derive naturally from a relational declaration as simple as (two examples)

VAR RELATION {I:INT C:CHAR} IC ;
CREATE TABLE IC (I INT, C CHAR) ;

Which of the two approaches is the easiest and the simplest ?

Further thought exercises.

A constraint must be introduced to the effect that "the I value must be less or equal than the position of the C value in the alphabet."  Write a query to check whether this constraint is currently satisfied by your EAV database, and spell out the strategy you're going to follow to enforce this rule.  Is it simpler or easier or less work involved than the following ? (two examples)

CONSTRAINT IC_RULE1 ALL(IC, I<=INDEXOF(C,"ABCDEFG...XYZ")) ;
ALTER TABLE IC CHECK (I<=INDEXOF(C,"ABCDEFG...XYZ")) ;

A constraint must be introduced to the effect that the I values are unique identifiers for any tuple.  Write a query to check whether this constraint is currently satisfied by your EAV database, and spell out the strategy you're going to follow to enforce this rule, or write out the declarative constraint that will enforce this in your EAV database.  Is it simpler or easier or less work involved than the following ? (two examples)

VAR RELATION {I:INT C:CHAR} IC KEY {I};
ALTER TABLE IC KEY (I) ;

Further thought exercise.  Read this sentence carefully.  Now look at the following EAV structure.

??????ATTRIBUTENAMEVALUE
GUID-WXCVBN-AXIOPOS1
GUID-WXCVBN-AXIOWORDRead
GUID-WXCVBN-AXOOPOS3
GUID-WXCVBN-AXOOWORDsentence
GUID-WXCVBN-AXO0POS4
GUID-WXCVBN-AXO0WORDcarefully
GUID-WXCVBN-AX1OPOS2
GUID-WXCVBN-AX1OWORDthis

I could write that same sentence, in cryptic form, in four lines :

Word 1 in the sentence is Read.
Word 3 in the sentence is sentence.
Word 4 in the sentence is carefully.
Word 2 in the sentence is this.

And EAV even goes beyond that in turning that into 8 lines !  How much work and computation does it involve to reassemble the original sentence from this ?



All of that should be sufficient to illustrate how much veracity there is to the claim that "EAV makes database maintenance easier".  Instead :

If you do not care about the data type of (the values in) your database fields, then EAV is your thing.
If you do not care about the meaning of the contents of your database, then EAV is your thing.
If you do not care about the integrity of your data, then EAV is your thing.

(In each of these cases, "do not care" is supposed to also include the notion that "the user doesn't mind paying all those extra hours you put in solving problems that the DBMS has already solved for you".)

No comments:

Post a Comment