Sometimes, you can enter into a technology too early. The groundwork for semantics was laid down in the late 1990s and early 2000s, with Tim Berners-Lee's stellar Semantic Web article, debuting in Scientific American in 2004, seen by many as the movement's birth. Yet many early participants in the field of semantics discovered a harsh reality: computer systems were too slow to handle the intense indexing requirements the technology needed, the original specifications and APIs failed to handle important edge cases, and, perhaps most importantly, the number of real world use cases where semantics made sense were simply not at a large enough scope; they could easily be met by existing approaches and technology.
Meanwhile, the Big Data initiatives that had marked the early part of the 2010s were facing some real problems. The original promise of Hadoop as a map / reduce framework had ended up creating large numbers of data lakes that aggregated content but that sat under-utilized. Data scientists struggled to deal with dirty data that was no cleaner for having been put in data lakes. JSON databases had grown in popularity, but they were proving hard to query consistently, and all too many Hadoop projects ended up becoming large, slow, but cheap data graveyards for regulatory data (the kind of data that must be retained for five years).
Context determines the user's focus on the database within in a applicationKURT CAGLE
Most databases exist primarily to service a specific application. The schema (or data model) of that database will typically be set up to provide access optimally for a particular component - customers, for instance, or parts or products. The basis then of most such applications is the record, which is a view of information made from joining multiple tables together using keys. Because most “records” in this regard are actual composite structures, applications usually make use of cursors, which can be thought of as "You are here" signs within the set of records.
NoSQL databases tend to have multiple such cursors - one describing where a data document is located within a database, a second that describes where, within that data document, the current moment of interest is. In that particular realm, these cursors are better described as contexts, and can be thought of as a way of marking the point of greatest interest in an application right now. Notice, I didn't say "in a database." Context is something that can only be determined externally, based upon the specific requirements of the application itself. To the extent that you can separate context from data storage, you have the means to widen the lens, so to speak, going from something that lets you view or look at only one aspect of the data space to one that lets you see it from all possible perspectives.
This is what semantics gives you. A semantic database is a set of interconnected resources, where each node (the starting and ending point of an arrow) is some kind of resource and each edge is a relationship. A given resource can then be seen as the sum of all the properties (attributes) and relationships that connect the resource node to other nodes, in what is collectively called a graph. While there's more to it than that, this representation can also be thought of as taking a relational database and exploding each row's id, column and value into a “triple” of information, then storing all of these triples. While it takes more memory to do this, the advantage that you get is that, with a semantic triple store, you can build context-free data systems.
When a system is context-free, this means that there is essentially no preferred direction that the data itself takes - there is only the context of the resource that you are currently looking at the database from. If your context is a customer object, then what can be seen from that are addressed, contracts that the customer is party to, communication channels that they use. Contracts (including smart contracts) identify the products that the customer is contracting for, which are in turn tied to the manufacturer that produces that product, and so forth. This interconnectedness is hard to do with relational databases, primarily because chaining to create graphs is difficult to express in SQL without having schematic knowledge. At the same time, while it is easier to do it with NoSQL representations like XML or JSON as long as the information is contained within the same document; it's more difficult to do well when dealing across documents.
This context-free aspect in turn can help to identify where semantic databases are best suited for, as well as pointing out the different architectures of the beasties in the semantic zoo. While there are almost as many such potential architectures in the semantic space as there are with a relational database, there are a few that seem particularly well suited for semantic vs. relational or NoSQL representation: knowledge graphs, smart data hubs, semantic data catalogs, metadata managers and smart contracts.
It should also be noted that knowledge graphs also work well for managing reference data and taxonomies, both of which typically have strong curational aspects.
Smart Data Hubs: Semantic Databases
A smart data hub is perhaps the closest thing to a relational database in the semantic space. Such hubs usually involve loading data into a system directly from an external database, spreadsheet, or structural document, with enough metadata sprinkled in via the ingestion process (the process where data is converted to a form most useful for the database) to help support search and navigation functions. A number of vendors now offer smart data hubs, including Top Quadrant and MarkLogic.
Smart data hubs can be pure RDF, where everything is converted into a single canonical namespace upon ingestion, mixed RDF, where the data is stored in RDF but there is only minimal harmonization to a single canonical standard, or minimally-RDF, where the data is stored using XML or JSON while the metadata is then extracted and stored in RDF. In general, the RDF in all three cases serves primarily to manage identity keys that make it possible to identify when two resource representations are about the same concept. In general, the more the content is contained in RDF, the more that a programmer can use SPARQL, the RDF Query Language. The trade-off is that in certain triple stores (such as MarkLogic) there may be other mechanisms to do queries. Semantic search is not textual search, they use different mechanisms to identify how and which nodes satisfy specific query constraints, but there is enough overlap in what they do that they can overlap.
A Smart Data Hub can be thought of as a semantic data warehouse, where everything is self-contained within the triple store. The advantage of this, especially when you're dealing with either pure or mixed systems, is that you can extract data in a wide variety of formats, from CSV (spreadsheets) to JSON, XML, word documents, web pages and driving any number of viewers and dashboards. The information contained within is likely to not be curated in a human sense, though human beings will likely be responsible for writing ingestor logic that translates data from external sources, but smart data hubs also frequently incorporate provenance information (where data came from) from the ingestion in order to make it easier to manage governance of that data.
With a sufficiently robust query system, it is possible to federate external data into a smart data hub as part of a query, but most typically this also requires that the data in question is in RDF (and hence storable and referenceable in temporary in-memory graphs) and in many cases that the external content contains enough of the metadata model used by the source system to be queryable. This is called in-memory federation, and in general should be avoided because of its impact on both performance and memory requirements.
A particular architecture that is gaining popularity in the semantic space is the semantic data catalog (SDC), which can be thought of as the marriage of knowledge bases and smart data hubs, and has a lot of similarity to a library card catalog. In essence, within an SDC, each “card" includes a reference (usually a URL) to a specific resource, whether it be an electronic book or an endpoint of a microservice with a specific set of key parameter. The card contains enough information (and connections to other cards) to allow a person to navigate across the information space, but it does not actually include the resources themselves, only their address.
Semantic data catalogs are often useful when you have a large number of heterogeneous, non-RDF based databases. In this particular case, the resource in question is an abstraction, an entity that is used to hold just enough information to make the resource searchable, but that also contains both the metadata to identify a particular representation of that resource in a traditional database (perhaps exposed via ODATA or similar microservice architecture, perhaps stored within a Smart Data Hub that has the corresponding link).
This is also critical for working with both content and digital asset management systems. The assets themselves are generally not stored within the same database as the catalog. Instead, they surface enough metadata, perhaps through reading of EXIF data, or (such as is the case with SmartLogic and similar systems) performing entity extraction of metadata and storing this annotational information within the SDC. This also helps in resolving master data, as this makes it possible to identify both the resource identifiers and the associated relationships.
Notes then those semantic catalogs essentially retrieve links to data, not necessarily data itself. The catalog does not automatically translate from one source to another, though having a semantic data catalog is a necessary precursor for this to happen. Schema to schema mapping (also known as ontology to ontology mapping) is a surprisingly complex process, much akin to translating between languages. However, this also brings a warning. Even although human language translation is finally gaining a fair amount of traction, the precision in meaning of technical metadata (such as dimensional modeling), and the ambiguity of business language processing means that at best we will be manually modeling translations for at least another decade before we get tot he 99.99% that businesses require.
Before leaving the topic, it is also worth noting the semantic catalogs are actually ideal for storing information for real world catalogs as well. Most catalogs are, when you get right down to it, highly referential in nature, with lots of categorization, links to resources, and the need for consistent annotation. Certain aspects of catalog entries are less ideal, such as sales prices, counts and similar transactional content, but these can generally be stored externally and then linked to by reference. It should also be worth noting that this content data can also be retrieved as part of the generation of output either within or after a semantic query.
Metadata Managers are a variant of data catalog that usually come into play with organizations that are dealing with differing but conceptually overlapping ontologies. This is typically a problem when you have a data catalog type environment, but because of acquisitions, you still end up with multiple ontologies that overlap and need to be translated. In this case, there's usually the goal of creating a single canonical ontology, but because the source ontologies are still in use, an intermediate stage is needed to manage the translation until they can be phased out. These differ from semantic data catalogs because they are managing mappings from one ontology to another, and they are actually a pretty crucial step towards a universal data conversion engine.
One of the ultimate goals for an organization should be towards the creation of a single canonical conceptual model for the expression of information between organizations. This does not mean that everyone should use the same model for everything, but it does mean that if you wish to communicate information between departments or with business partners, you need to adhere internally to a common representation for the bulk of it. This means that a digital transformation strategy ultimately comes back to definition of a clear, consistent business and technical language for the organization. Many things, like data catalogs and smart data hubs managers, ultimately help preserve the existing cacophony of independent data models if there isn't ultimately a target canonical model that everything moves towards. This means that adoption of any semantic technology usually comes back down to maintaining data discipline and governance. This is why effective metadata management is less about tools than it is about process.
Smart Contracts and Internet of Things
The final beastie in the semantic zoo is a bit different from the others, though again it relies on many of the same principles. Smart contracts first emerged in the context of blockchain, but blockchains do not, in general, actually store data - they store unique pointers. A semantic system is ultimately a global pointer system, where every resource has a unique machine-readable name. This makes semantics systems logical complements to blockchain, especially in the arena of smart contracts. A smart contract, then, is resource that binds other resources together. The title of a car is a good example - it binds a vehicle (identified by a VIN) to an owner (identified by any number of different systems) through an issuing authority, over a specific period. The title is a contract that indicates that the issuing authority recognizes the owner has the right to possess and use the vehicle for a certain set of purposes. It's also a contract because it stipulates penalties for non-compliance to the agreement of the title by the same issuing authority, and as such, the fulfillment of those penalties also needs to be tracked.
Smart contracts are also tied intimately with the Internet of Things (IoT). IoT is ultimately about networks and the relationships between resources, not just in simple properties but also with such factors as security, actions, discovery and related areas. Increasingly IoT systems are making use of semantic graphs to keep the complex web of interconnectedness manageable and easy to traverse and query.
This is a high level view of semantic architectures that I believe will become significant to the enterprise over the course of the next three to five years. The take-away from this is the understanding that just as relational databases tend to be useful for certain types of applications, so too will semantic databases, where what is being stored - the information architecture - is easily as important as how it is being stored (the structural architecture).
Kurt Cagle, managing editor/Drupal lead/contributor, has worked within the data science sphere for the last fifteen years, with experience in data systems architecture, data modeling, governance, ETL, semantic web technologies, data analytics, data lifecycle and large scale data integration projects. Kurt has also worked as a technology evangelist and writer, with nearly twenty books on related web and semantic technologies and hundreds of articles.