Why Data Modeling Is Important (And Why It's Not)

By Kurt Cagle  |  November 14, 2018  |  Source: CogWorld on FORBES


Data modeling does not excite passion within programmers. Your average Java or Python developer probably doesn’t even realize that they are doing it when they write programs, in great part because a data model by itself doesn’t do anything. It simply is. In computer science terms, doing things is the hallmark of imperative (command) oriented languages, while simply being is declarative (assertional or existential) oriented programming.

The “exciting” things programmers like is making things happen — deep down, every programmer out there got their start because they wanted to write video games. Seeing a web page or application come to life is the moment that programmers live for, and to get there, you have to have something that actually moves those bits.

Those of us who are in their mid-50s and older remember when object-oriented programming didn’t exist. You had “types,” that were essentially static bundles of related variables, each of which might hold other variables in arrays or hash tables, and most algorithms that Don Knuth wrote about involved how you manipulated data contained within those types to change them into other types, which then were fed as data structures into instructions that would read this type information to create the proper side effect.

What the advent of Smalltalk, C++, Java and other similar languages did was to impose a certain amount of discipline on how these type structures operated. First, it introduced the notion of a Class, in which, rather than having type structures passed to functions defined in a library, inverted this by the data structure living within the class.

This idea, called encapsulation, might seem trivial today, but its effect was roughly analogous to what happened three and a half billion years ago when free floating strands of basic RNA, basic biological computers, evolved a set of encodings that used RNA to build early protoplasmic cell walls out of proteins. Among other things, this provided a layer of protection so that the RNA could begin to last long enough to differentiate and build other capabilities. Additionally, a minor mutation to one of the four nucleotides that make up RNA, the introduction of a methyl group that turned Uracil into Thymine, made it possible for the modified RNA to create a permanent double strand and hence remain stable enough to replicate. Encapsulation made it possible for chemistry to become biology.

In that respect, there is a real analogy here between a class and an instance. The class is the DNA — it identifies the various structures that a given instance should have, describes how they move from state to state (where state indicates what facet a given internal variable is in at any given time). The class is not the instance — the stringy strand of a DNA molecule is in no way the same thing as a person made up of all the cells encoded in that DNA — but the class is necessary for the person to exist.

In the biological realm, nothing is free. You need raw materials to create things like proteins, and you also need energy sources, typically in the form of sugars, along with a mechanism for converting that energy source into the energy necessary to build those proteins. In the cell, that’s generally accomplished via adenotriphosphate, more frequently known by its acronym ATP. In the digital realm, the energy comes in the form of computer cycles powered by electricity, working on simpler data structures to construct ones that are more complex. (This metaphor, by the way, is EXTREMELY leaky, so bear with me here.)

Objects, created by classes, have internal state. One other aspect of encapsulation is the notion that in general, that internal state is “hidden.” Once you have put state information into an encapsulation, then essentially that information disappears. What remains are signals that a given object emits (“my internal state has changed!”, a.k.a. events) and handlers for signals that the object absorbs from the outside environment (methods). Calling a method on an object is simply a specialized form of event handler.

Encapsulation (far more than inheritance or polymorphism) that defines classes, yet the weakness of encapsulation is that what exists as a cascade of class instances in memory does not necessarily persist well when it needs to be frozen in some kind of storage medium. Different kinds of computers, operating systems and languages have different ways of representing binary objects, and this is compounded when you need to store a state that allows you to “resurrect” a frozen object in a different environment running a different application.

Serialization is the process of converting a cascade of objects into a persistable format, and is key to such things as writing an object to disk, to a database, or to a stream. The inverse, reading a serialization to create a cascade of objects, is called parsing. In stand-alone objects, serialization and parsing are typically done comparatively seldom, usually at the time that you save or load a file, respectively. However, the moment that your application starts becoming part of a broader network, serialization and parsing end up playing a much bigger role, with that role becoming more important the higher up the enterprise stack you’re talking and the more people use not just the data but the model itself.

Enterprise data modeling has emerged only comparatively recently (within the last decade or so) as the scope of applications have become large enough that a consistent vocabulary becomes necessary. Most of the early such vocabularies were done using XML, but while XML is superb as a mechanism for holding structured hierarchical content (far better than relational databases), it works far less well when dealing with references to shared objects. It is one of the big reasons that most XML-based enterprise efforts have fared at best only semi-successfully.

RDF, the Resource Description Framework, has been emerging as a preferred tool for doing such enterprise level modeling. It works by breaking data structures into simple statements that can then be chained together into linked data sets known as graphs. The following illustrates how one such graph (for a customer) might look:


A model (or more properly an exemplar) showing how a customer may be modeled.©2018 KURT CAGLE. ALL RIGHTS RESERVED.

Each of the rounded rectangles in blue are “nodes” in the graph, representing entities, things such as people, places, organizations and so forth. The dark green represents nodes that are more descriptive, and usually fall into the realm of reference data. They effectively categorize the information in blue. The yellow fields in turn represent literal data — a string of text, numbers, dates and so forth, usually qualified by some kind of datatype (the arrowed green boxes).

The arrows represent relationships between the other kinds of nodes. Think of a spreadsheet showing a table of content, where each row represents one item of a given type in the table, each column represents a relationship, each cell represents a value or a link to another entry in another sheet of a workbook, and each sheet (table) represents a class of some sort. From any one of those nodes, you can then get a perspective of what the data model looks like from the context of that node.

The diagram above isn’t itself the data model. Rather, it’s what’s called an exemplar, an example showing what the data model would produce once filled with data, and expressed in a Tinker Toy like representation. The model, however, contains what you would expect to find in a model — schematic and annotation information about classes, properties and constraints (it’s just not as visually self-explanatory). What makes RDF so extraordinary is that it is an abstract framework, so that you can represent the same information in different ways, from diagrams to text files to XML and JSON and even spreadsheets. Additionally, you can also express the model itself in the same way, meaning that the model becomes just another part of the overall graph.

With RDF (and the whole discipline of semantics), you can mix the data and the metadata, reducing the amount of assumptions that have to be made about the data itself. Such a model can do things like encode expected types of units (meters vs. feet), or currencies (US dollars vs. Japanese Yen), can provide descriptive annotations, and references back to specific provisions to a standard or contract (which makes semantics and blockchain complementary technologies). You can also use RDF to manage one of the most vexing problems in enterprise data management — the resolution of identifiers coming from various systems to represent the same person, place or thing.

This ability to reference metadata becomes important in another domain: creating an association between a business concept and a technical implementation of that concept. Business analysts frequently work with “data dictionaries,” specific concepts that they want to capture, but all too often there’s a real disconnect between the business terminology used and the representation of that data within various applications. Semantics can provide the linkage necessary to ensure consistent governance between the C-Suite and the Server Room.

However, semantics — and the use of a unified model — has its own costs and disciplines. It is possible to build semantic data models that can federate external data systems, but these often have significant performance costs, making it useful perhaps for one-time ingestions but not necessarily for complex data queries. Instead, it is usually better to build up a distributed data model around open linked data principles, where you create multiple nodes that share a unified model and work with RDF intrinsically, uploading into each node a subset of the information that an organization uses to better provide for more effective data governance (a topic for a future article).

So, given that, do you need an enterprise data model? If you are building a more or less stand-alone application where the data has no significant reuse, then data modeling is often counterproductive. Alternatively, if your organization is large enough and diverse enough where different departments are working with differing aspects of the same set of resource, then a data model can be critical for data interchange, especially since the boundaries that make OOP work at the micro-level usually do not work quite as well when talking about complex systems.

Indeed, it may be that computationally we’re at a level where the same processes that prompted the encapsulation of data into objects based upon classes in the first place are now re-occurring, but now at the enterprise level. For now, that discipline is at the semantic convention level, but with machine learning, blockchain, the Internet of Things and similar technologies all beginning to converge, it may very well be that a new paradigm is evolving to deal with such macroscopic systems of objects.


Kurt Cagle, managing editor/Drupal lead/contributor, has worked within the data science sphere for the last fifteen years, with experience in data systems architecture, data modeling, governance, ETL, semantic web technologies, data analytics, data lifecycle and large scale data integration projects. Kurt has also worked as a technology evangelist and writer, with nearly twenty books on related web and semantic technologies and hundreds of articles.