The Trouble with Standards

Data standards help establish uniformity, but not always for the good.KURT CAGLE 2019


By Kurt Cagle  |  February 06, 2019  |  Source: CogWorld on FORBES

There is a running joke in standards circles:

God must love standards. He's made so many of them.

If you spend enough time working with standards, ontologies, reference data or information modeling, you will find yourself involved in the process of creating, modifying or defending specific standards. TOGAF, NIEM, XBRL, FIBO, UBL, Dublin Core,, W3C, standards are ubiquitous.

You would think, given this, that it should be the case that everyone can now communicate easily over any domain, or at least those that have some kind of standards effort, but the reality is usually much uglier: standards paralysis from too many competing standards, forks in versions, religious wars about how comprehensive standards are, poorly specified standards, and so on, not-invented-here-it-is, data siloization and so forth all serve to make standards more miss than hit.

The Standards Life Cycle

It's worth spending some time looking at the innovation cycle, and how standards play into it. There are several key phases in this life cycle:

  • Coalescence. This is the period where there is a lot of research about the domain and implications of a given set of technologies. No formal standards exist, though towards the end of this period there is typically a point where one player within the sector has attempted to create a new standard in order to stabilize the information space.
  • Pareto (80/20) Period. Eventually, one organization is able to gain enough of a market share in the sector that they can establish a de facto standard. Because this is perceived to be a competitive advantage, the remaining vendors (which collectively make between ten and twenty percent of the market) start to form a competing standard that at least weakens the proprietary grip that the first vendor has.
  • Standards War. This is a period, sometimes extended, during which both sides make their cases to the industry, religious wars break out between partisan developers, lawsuits fly and standards bodies get involved as arbiters. The end comes when either the dominant player comes to the table to iron out an agreement, or the coalition supporting the alternate standard (there's usually only one by then) switch sides or fall away. The resulting standard typically bears only a passing resemblance to what was originally proposed, nobody is happy, but nobody is really unhappy either.
  • The Kumbaya Period. The marketing engines get into the game at this point making sure that the standards are now printed on the side of all software boxes (or web pages or app markets or telepathy or whatever the current means of distribution is). The bizarre acronyms that make up the standard are now remembered, and the bizarre acronyms that were proposed originally fall into obscurity. Interoperability bliss ensues, for a bit.
  • The Oops We Goofed Period. After a period of time, enough people have used the standard to realize that certain assumptions and ground rules were made that didn't quite work out. During this period, a 2.0 version is produced, one that is pretty damned good, actually. Some vendors complain, because they've invested so much in getting the 1.0 working, and some programmers complain because, well, new language, but overall adoption goes pretty well.
  • The Forgotten Child Period. After some time, a 3.0 version is released. It gets implemented by a handful of newer vendors, but there are too many toolchains that are built on the 2.0 version (and even a fair amount that never made the jump from 1.0). The 3.0 version, even if it is better than the 2.0 (and it usually is) becomes the accidental kid, the one that hadn't really been planned and while of course you love all of your kids you're really not quite sure what to do with #3, who is just ... well, different, you know. For some reason they usually become artists or musicians.
  • The Who Gets the Kids Period, also known as the "Fork You" Period. The industry has been on the standard for a while, long enough that the programmers who are entering into programming don't remember steps #1 through #4. They know that their parents liked the standard, but it's a lot like how kids feel about parental sex - of course it was necessary for their own personal existence, but ... ew. Gross. If you can't play video games with the standard, then what use is it. There is much angst during this period, and the older programmers (who now are PMs and architects and own their own software companies) watch in shock as the standard forks dramatically (cf., JSON), while the younger ones stay glued to their shiny iPhones and tablets, the slackers.
  • The Golden Years. The older standard remains in use, though conference devotees note the increase in fluffy white beards, hearing aids, and walkers at their annual gatherings. Typically by this time the standard has become infrastructure, used as a key component for a great number of things, but you don't get the real cool articles written about it in the tech press anymore, and investors who used to wait on baited breath for your calls now try to remember where they've met you before. Programmers who are now working with the latest and greatest call you grandpa.
  • The End of the Road. Where do standards go to die? Is there a standards graveyard out there? You occasionally see whispered references to small ISO numbers, kind of like that first issue of Action Comics bought in an estate sale from an old Victorian mansion's crazed former comic book collector. Mostly such standards are ghosts, their documentation scattered, their ardent partisans now playing croquet at the retirement home.

Maybe a bit tongue in cheek, but this is actually pretty close to what happens in the real world. Standards, and this includes ontologies, are essentially scaffolding. They are agreements between people in a group that this is the set of rules that they will abide by, until it is no longer convenient or cost effective. This is because business evolves, the things being modeled change, the problems being solved are solved and are replaced by new problems that only become significant because the old ones were solved.

Standards and Digital Transformations

This notion that standards are written in sand is important to understand when discussing enterprise digital transformations. There are benefits to having a centralized, canonical ontology. Such ontology can reduce the overall confusion in naming, can help to standardize on basic concepts, can make master data management easier. However, in establishing such a standard, it is also necessary to lay in a pathway for expansion, evolution and ultimately obsolescence, not just for individual concepts, but also for whole ontologies.

Why do this, if the effort of creating ontologies is so hard? A big part of it comes with the expectation that the way that we store and query information is itself still very much in flux. As machine learning and artificial intelligence systems become more sophisticated, the disadvantages of dealing with legacy structures increasingly outweighs the benefits of holding them in that form. Organizations merge, get spun off, change priorities, and ultimately fade away. What we store, and how we choose to store it, shifts in response to those changes in priority and technology.

This means that ontology change management is a critical function of any digital transformation effort. It's not a once done and forgotten activity; effective data governance requires that you are constantly re-evaluating the state of your data, the cost of change to your organization's infrastructure, and the benefits to be gained by making that change. It means maintaining a constantly rolling window of data versioning, so that once a technology or a standard falls outside of that window, it gets turned over in a consistent fashion.

All too often companies tend to look at data investments as sunk costs, but this view means that fading software and hardware goes from giving you a return on your investment to draining the value of that investment. Knowing when this happens is hard to determine a priori, because there are no clean tools to tell you when software is no longer worth the cost of maintaining it (though give it a few years), which is why it is usually worth erring on the side of changing both software and standards too often rather than waiting too long.

The Rise of Context-Free Modeling

This point is actually critical when talking about ontologies in particular. One of the reasons that semantic technologies are so important is that they represent a shift in thinking towards context-free modeling. The idea here is that, particularly at enterprise scale, the ability to create single unified enterprise models is hard to the point of being well-nigh impossible. There are too many silos and sub-organizations with existing data infrastructure to force everyone into a single canonical model. Yet at the same time, trying to maintain multiple heterogeneous models that can cross communicate is almost as bad, as it creates a combinatoric approach to modeling that is unsustainable.

A context-free approach to modeling in essence works on the assumption that you can identify resources within an organization across different data stores (and likely identifiers). This usually takes a mixture of semantics, master data management or machine learning (I've seen systems that employ all three) and that once you have those resources identified, then you can determine those properties (attributes) that you wish to capture from various sources. This needs to go simply beyond dropping the attributes into the semantics stores as is, however. Instead, when a new data system is ingested, metadata about the attributes should be captured - including types, constraints, dimensional units, labels, provenance and …. perhaps most importantly …. conformance to an underlying core ontology.

A core ontology is usually semantically neutral - it doesn't necessarily deal topically with the domain of the broader data set, but rather identifies enough information to indicate that a given property is searchable in certain ways, should be used for display in specific manners, and follows certain patterns. This could be considered an operational ontology, one that makes the data more readily usable in different contexts but that doesn't actually contain any specific data by itself.

This also plays a big part in maintaining multiple controlled vocabularies (what are sometimes referred to as categorizations). Different systems will almost certainly have different categorizations (for instance, two different customer databases may have different sets of enumeration for identifying customer types). Synchronization of these is one of the bigger headaches for most standardization efforts, because these enumerations are in effect classes in their own right. By working with a core ontology (ironically something like a SKOS or similar taxonomically oriented approach) and establishing ways of correlating enumerations from one vocabulary silo to another, semantics can help to manage these kinds of relationships far easier than relational data systems.

For instance, you may have a property called “CUST_INTERESTS” in one database and “FAVRTS” in a second database that contains a list of topics that a given customer may like. Each of these is in separate taxonomies because they were put together by different divisions. Bringing these two sets of data into a central semantic data system without doing anything else means that you can manipulate both, but otherwise tells you very little. On the other hand, if you can indicate that both of these properties are categorizations, and furthermore can indicate that the two may potentially overlap, then you can use the core ontology along with machine learning techniques to identify which fields values have similar definitions, the degree to which the two fields have the same frequency distribution of terms and so on in order to get a feel for whether two such enumerated terms are in fact (roughly) the same thing. Once that is done, further curation can be done by hand to confirm or deny these associations, and as a consequence make it possible to integrate data from one system into another one.

The core ontology is key here, because it helps to reduce the overall space of terms that needs to be searched dramatically. That core ontology doesn't really care about the overall topical domain of the associated knowledge graph (and shouldn't, because that introduces contextual bias). What it does do is make it much easier to harmonized differing data sets.

Indeed, one of the most exciting aspects of the next generation of semantics (in which machine learning plays an ever growing part) is the fact that it actually reduces the need to create canonical models at the outset. A canonical model can be established that identifies a preferred set of classes and attributes, but especially in semantics the idea that you keep the data that you have rather than trying to create a model and then shoehorn the content into that model is a very exciting one. In this case the core ontology serves the role of a library cataloging system for your data. In a system like the Dewey Decimal system or the Library of Congress System, the system doesn't tell you where physically a resource is. Instead, it assigns a topical identifier (a library code) to a given book or media resource, and the library then becomes responsible for indicating where in the building (or the lending library system), that particular categorization is located physically.


My expectation is that this process (dynamic modeling) will be the direction of ontology systems for the foreseeable future. It provides a balance of centralization (for common resources) and federation (for more specialized resources) The ontologies that come out of this process will likely be emergent rather than designed, though just as a garden in the hands of a good gardener is an exercise in controlled chaos, so too is the hand of the data steward necessary to ensure that the organization of the data within your organization doesn't become too fractal and chaotic.

Dynamic modeling won't necessarily change the overall standards (data information) life cycle much, but it makes it much more likely that your organization can grow and adapt in the data space without standards that are either too restrictive or too ambiguous for your needs.


Now excuse me - I have some hooligans I need to shoo off my lawn.

Kurt Cagle, managing editor/Drupal lead/contributor, has worked within the data science sphere for the last fifteen years, with experience in data systems architecture, data modeling, governance, ETL, semantic web technologies, data analytics, data lifecycle and large scale data integration projects. Kurt has also worked as a technology evangelist and writer, with nearly twenty books and hundreds of articles on related web and semantic technologies.