January 10, 2019 | Source: CogWorld on FORBES
The engines of IT marketing recently spun out a buzz phrase that's now gaining vogue in many businesses: Digital Transformation. While the exact definition varies depending upon who is currently pushing it, the notion can be summarized roughly as follows:
Organizations run on data, and in the twenty-first century, your organization needs to be able to take advantage of all of that data to remain competitive in the marketplace. By transforming your company to work more digitally, all of that data can be leveraged to gain a deeper understanding of customers, markets, competitors and trends. This digital transformation is something that a company needs to do to not only survive but also thrive in the new economy.
This sounds like a stirring mission statement, full of high concept and call to actions, partially because there is a lot of truth in it. Most companies and organizations today are not making full use of the data resources they have, are becoming increasingly siloed and locked down, and the vast bulk of all companies today will fail if they can't get a handle on what they are doing with the data within their organizations, especially against competitors who do successfully utilize what's around them.
However, such digital transformations are far from trivial to undertake, in great part because it requires changing both infrastructure and culture within an organization, in part because most managers, especially at the middle tiers of an organization, recognize that such a digital transformation may well result in them having less control, rather than more, of their particular divisions, and in part because most managers tend to have a rather dated view of data and information within their purview that leaves them with a number of misconceptions about what creating a fully digitally transformed company looks like.
The prevailing metaphor for data is that it is a liquid - it flows in streams, collects in pools and lakes, goes through pipes, becomes frozen and so forth. From the perspective of system engineering, this viewpoint makes sense, because in general the challenge in building IT systems is accessing, moving, collecting and transforming data. Plumbing is something that IT people have been doing for more than fifty years, and not surprisingly, they have become good at it.
Yet in all of these pipes and stores and lakes, it's also important to understand that data can be thought of as the snapshot of a particular thing in time. Typically, when you save something to a file system or database (and a file system can be thought of as a database with a different access protocol) what you are saving is the state of a thing being represented within an application. That thing, that resource, can be anything - a representation of an employer or customer, a transaction, a way of describing a product.
Until comparatively recently, most of the information within an organization was application-centric. This meant in general the interesting things being done with the data occurred primarily in the application layer, and the data that was persisted between sessions of the application existed primarily to be resuscitated by the application. This meant that a significant percentage of the logic and organization of that data existed primarily outside of the database, with the database serving then to store that information until the application next had a need for accessing it.
Only within the last decade has that been changing, as the idea of data being available enterprise wide has taken hold. This in turn has created different requirements on both data storage and data transmission, as agreements about how information was structured. Such structure was both syntactical in a common metalanguage such as XML or JSON, and semantical in what underlying model was used to describe properties (relationships) and structures (entities).
This shift in thinking brings with it a shift in how resources are identified. This has manifested in the rise of such areas as master data management, identity management and reference data management. Master data management and identity management are two different ways of dealing with the same fundamental problem: how to determine when two references to things are actually talking about the same thing. This is a problem even within the same database, as multiple people may enter the same information about a person, place or thing without being aware of the fact that a previous entry exists for that same entity. The ability to do a comprehensive search on the dataset can help with that to some extent, so long as the workflow is set up to perform such a search prior to committing a new record, though this doesn't necessarily guarantee exclusivity.
However, in an enterprise setting where there are potentially hundreds or thousands of databases, this problem of identity management becomes much more complex. Keys from one database to the next will most likely not match for the same resource, and even contextual clues may be sufficiently distinct from one database to the next to make identification difficult (once you move out of the realm of “customer” in particular, this matching can be highly problematic).
A similar problem comes with reference data. In the realm of data modeling, reference data can best be thought of as the adjectives - they categorize entities so that they belong on certain buckets. Typically, however, those buckets are often different from one database to the next, primarily because most applications there are few strategies within most organizations for taxonomy management, let alone for attempting to unify the various controlled vocabularies from one application to the next.
E major vendor in the data space is attempting to sell their solution to these problems. Half of them use machine learning to attempt to identify patterns and matches. The other half use indexing mechanism. What all of them have in common are two factors - they require that the data be centralized in a single repository, and in general they do at best a mediocre job of handling keys, because in most cases they are reliant upon consistency of patterns, something difficult to get when you're trying to pull data from multiple sources.
This, I believe, is why most data transformation initiatives will fail. There are things that you not only can do, but must do, to move an organization so that it is. Many of those things will not come out of a box. Some of them will be organizationally painful, because they concern who controls the flow of data in the organization, and this is a form of power that those in position of power will not willingly give up. While this list isn't exhaustive, digitally transforming any organization comes down to the following:
1. Identity the Entities Important For Your Business
2. Survey the External Data That Most Affects Your Business
3. Start With a Clean Slate
4. Catalog Your Metadata and Sources
5. Perform Data Triage
6. Establish Governance, Provenance and Cleanliness
7. Model Late
8. Unify Your Models
9. Remain Context Free
10. Federate Your Leaves
This next section looks at each of these points in detail. Before then, a caveat: there may be many approaches to the problem of transformation, but at least for the purposes of this article, I will focus on knowledge graphs and data catalogs. A knowledge graph is a related network of knowledge, tying data and metadata together using propositional logic. A data catalog is a specialized knowledge graph that not only contains basic information and relationships, but also identifies where within an organization that data is. In effect, these are metadata-oriented solutions, and I feel they are critical for success in such transformations. Now, onto the list:
Identity the Entities Important For Your Business
Before you can do anything else, it is critical that you spend time identifying each resource that you wish to track in your organization. If you are a manufacturer, then you need to know your products, your suppliers and supply chain, your distribution channels, your processes and the organizations within your organization that enable those processes. The same holds true for retailers, but you also likely need to incorporate sales managers, points of sale, catalog entries, marketing campaigns and so on. Most organizations need to track contracts, transactions, customers, prospects and on top of this most also need to manage rights, export controls, privacy information and so forth. Identifying ahead of time those classes of things which make up your business is a key part of establishing a road map for how your digitization process happens.
Survey the External Data That Most Affects Your Business
There is a phenomenon that many software companies run afoul of: the principle of Not Invented Here. If something was not invented here, then it's not good enough to develop on. This usually ends up creating proprietary stacks where the wheels and most of the rest of the vehicles are reinvented (sometimes several times). In the dataspace world, the corresponding concept is Not Our Data. If the information involved is not something that comes from within an organization, it's not good enough to use. This usually results in organizations redefining how regions (states or provinces) are modeled in a country, results in YAA (yet another acronym) for common concepts, and often times means potentially millions of dollars spent on reinventing those damn wheels. To the extent possible, especially when getting started, take advantage of existing data sets, of zip codes and gazetteers and linked data. Government data, when available, can prove incredibly valuable, and I suspect that companies which finally can consolidate data aggregation in specific markets will become huge over the next decade as they start selling this cleaned, curated data in as wide a variety of formats as possible.
Start With a Clean Slate
This is perhaps one of the hardest aspects of a digital transformation, largely because it flies in the face of so much vendor pressure. Moreover, one wants to put a magic cap on top of an existing database, and just query that database directly, until the stakeholders of that existing database refuse to do so because there are too many mission critical applications that are dependent upon that database not being hijacked. To that extent, a good knowledge base should provide just enough information about a resource to make it searchable, either via a Google Search-like app, a semantic navigator of some sort, or via a chatbot or similar natural language processing (NLP) tool. The key is to identify those things within an organization that need consistency first, and build out that information in a curated manner rather than attempting to pull this information directly from a database. Reference tables are often a good way to start, as these are commonly used information. Make this data available and easily consumable and you can in turn drive other data systems that emerge in the future.
Catalog Your Metadata and Sources
A key aspect of digital transformation is metadata management. This means not only determining what resources you are interested in, but also what databases contain relevant information about those resources. The central goal of digital transformation is to make your data findable and addressable. This is actually a pretty critical function - there are tools that allow you to track APIs, but these usually do not give you a context for saying “if I want to find information about customers, who has that information, how is it addressable and what keys do I need to use to get it?” A data catalog specialized knowledge graph performs that function. It can also make it possible to aggregate this information in a variety of different forms. However, again, to do this you need to identify and implement different vectors for getting this information into the catalog in the first place.
Perform Data Triage
Not all data is created equal. Ironically, while spreadsheets are often hideously bad places to store data (for any number of reasons) they are actually useful for gathering and managing metadata and reference data, as an example. A digital transformation strategy should be constantly triaging data as it is discovered. Should it serve as a source of record? Is the data clean? Is it going to be consistently available? Will it be useful across an organization? Will it require higher processing costs to make useful? This means that evaluating the cost of incorporating a new data source into the process should always be a part of an effective data strategy, because frankly, there are many data sources within your organization that are simply not worth the time to catalog.
Establish Governance, Provenance and Cleanliness
As the scope of datasets have expanded beyond the application boundary to that of the enterprise (or even between enterprises), the importance of governance has risen from being a largely advisory role to becoming essential within organizations. The data governor (often known as a CDO or CIO) ultimately becomes responsible for the reliability, cleanliness, veracity and relevance of the data within the enterprise. The work for individual components of that data story becomes data stewards, responsible for specific domains of content, which in turn are typically curated by data librarians. In this context, a data steward is typically an ontologist, someone who is responsible for determining the modeling, structure and metadata requirements for a given model, while the librarians are taxonomists who add descriptive content and establish categorizations on the resource entities themselves. They also determine the provenance of data, tracking where the data came from, how it's been modified over time, and determining the relevance and reliability of given attributes from various data sources. Finally, the librarians will likely use semantic tools to develop descriptive metadata, content which usually doesn't get captured in relational databases but that become critical when trying to search for specific content.
This one may seem a bit surprising, but it's actually crucial in the transformation process. Once you have determined the entities that you wish to track at the enterprise level, do not be in a rush to start creating schemas or models. There are several reasons for this. Unlike pulling data from databases, a semantic knowledge graph should be somewhat opportunistic - you take the information that you find, not necessarily the information that you believe you need. By working under an open world assumption, you can capture the information that may be useful even if it wasn't originally in your map, and then fill in the details elsewhere with deeper content. Often times, what happens is that a natural model emerges organically in this fashion, rather than one being forced by someone's preconceptions.
Unify Your Models
As you gain more insight into the attributes associated with a given entity, an effort should be made to establish clear definitions on what constitutes an entity and what attributes exist in common between entities. This requires a number of techniques, including being cognizant of dimensional modeling, working towards unifying reference data and building what amount to key chains that allow for seamless MDM. Simply throwing data into a semantic knowledge graph will not make it semantic, but performing data harmonization and smart federation will. Model unification should be seen as a long-term goal, but by unifying key pieces early, it becomes easier to build applications consistently.
Remain Context Free
This is likely a new concept to most people. Most applications have a context - they work on a particular class of data, such as customer entities, transactions, product descriptions and so forth. However, when dealing with information at an enterprise level, one of the most powerful approaches that you can employ (albeit one with its own set of headaches) is to remain context free. This means that in the structure of each of these entities, there is just enough information to allow a resource to identify its type, and from that to then determine the attributes and relationships that the type itself has. This technique, also called introspection, makes it possible to build applications that can auto-configure themselves based upon what types they are currently working on. This becomes especially important when you may have potentially hundreds or even thousands of classes involved, though even there context-free programming can reduce the overall complexity of data models dramatically by keeping classes simple and then applying different categorical constraints to determine the presentation of any given class instance.
Federate Your Leaves
Building enterprise data knowledge graphs involves a trade-off - you want enough information within a knowledge graph to handle 80% of the queries you are likely to encounter, but you don't necessarily want to replicate completely all data from all data systems. This is where federation comes into play. Federation involves retrieving content from external data stores, typically as part of a query. In pure semantic systems, the content is added to the graph before the query itself is fully evaluated (typically using a meta-language called RDF), but partial federation can be done much more painfully, without semantics. There are several strategies that you can use for federation, though the one that I've found seems to work best is to build out the knowledge graph internally first, then when the dynamics are worked out, migrate the outer “leaves” to a more data-centric node.
You may have noticed, in going through this, that there was no talk about machine learning or artificial intelligence or Internet of Things or other beasties of the current hype landscape. These things are certainly factors in enterprise digitalization, tools that reduce your reliance upon purely human curators, but it's important to understand that none of those things by themselves is going to be as important long term as gaining control of your overall data strategy. The end goal of creating a digital enterprise is to create a strategy where your information, whether data products or documents, can be identified, curated, tied into a knowledge graph, queried and referenced. It's also going to be an ongoing process - just as agile has changed the methodology of development, so too will digital transformations change the methodology of data (and metadata) management. Such semantic data catalogs are in effect the index of your virtual organization, the way to readily identify where the resources that make your company work are located and defined. That's what digital transformation is ultimately all about.
Kurt Cagle is Managing Editor for Cognitive World, and is a contributing writer for Forbes, focusing on future technologies, science, enterprise data management, and technology ethics. He also runs his own consulting company, Semantical LLC, specializing on Smart Data, and is the author off more than twenty books on web technologies, search and data. He lives in Issaquah, WA with his wife, Cognitive World Editor Anne Cagle, daughters and cat (Bright Eyes).