Garbage in is garbage out. There’s no saying truer in computer science, and especially is the case with artificial intelligence. machine-learning algorithms are very dependent on accurate, clean, and well-labeled training data to learn from so that they can produce accurate results. If you train your machine-learning models with garbage, it's no surprise you'll get garbage results. It's for this reason that the vast majority of the time spent during AI projects is during the data collection, cleaning, preparation, and labeling phases.
According to a recent report from AI research and advisory firm Cognilytica, over 80% of the time spent on AI projects are spent dealing with and wrangling data. Even more importantly, and perhaps surprisingly, is how human-intensive much of this data preparation work is. In order for supervised forms of machine learning to work, especially the multi-layered deep learning neural network approaches, they must be fed large volumes of examples of correct data that is appropriately annotated, or "labeled", with the desired output result. For example, if you're trying to get your machine-learning algorithm to correctly identify cats inside of images, you need to feed that algorithm thousands of images of cats, appropriately labeled as cats, with the images not having any extraneous or incorrect data that will throw the algorithm off as you build the model. (Disclosure: I’m a principal analyst with Cognilytica)
Data Preparation: More than Just Data Cleaning
According to Cognilytica’s report, there are many steps required to get data into the right “shape” so that it works for machine-learning projects:
- Removing or correcting bad data and duplicates — Data in the enterprise environment is exceedingly “dirty” with incorrect data, duplicates, and other information that will easily taint machine-learning models if not removed or replaced.
- Standardizing and formatting data — Just how many different ways are there to represent names, addresses, and other information? Images are many different sizes, shapes, formats, and color depths. In order to use any of this for machine-learning projects, the data needs to be represented in the same manner or you’ll get unpredictable results.
- Updating out of date information - The data might be in the right format and accurate, but out of date. You can’t train machine-learning systems when you’re mixing current with obsolete (and irrelevant) data.
- Enhancing and augmenting data - Sometimes you need extra data to make the machine-learning model work, such as calculated fields or additional sourced data to get more from existing data sets. If you don’t have enough image data, you can actually “multiply” it by simply flipping or rotating images while keeping their data formats consistent.
- Reduce noise - Images, text and data can have “noise,” which is extraneous information or pixels that don’t really help with the machine-learning project. Data preparation activities will clear those up.
- Anonymize and de-bias data - Remove all unnecessary personally identifiable information from machine-learning data sets and remove all unnecessary data that can bias algorithms.
- Normalization - For many machine-learning algorithms, especially Bayesian Classifiers and other approaches, data needs to be represented in standard ranges so that one input doesn’t overpower others. Normalization works to make training more effective and efficient.
- Data sampling - If you have very large data sets, you need to sample that data to be used for the training, test and validation phases, and also extract subsamples to make sure that the data is representative of what the real-world scenario will be like.
- Feature enhancement - machine-learning algorithms work by training on “features” in the data. Data preparation tools can accentuate and enhance the data so that it is more easily able to separate the stuff that the algorithms should be trained on from less relevant data.
You can imagine that performing all these steps on gigabytes, or even terabytes, of data can take significant amounts of time and energy. Especially if you have to do it over and over until you get things right. It’s no surprise that these steps take up the vast majority of machine-learning project time. Fortunately, the report also details solutions from third-party vendors, including Melissa Data, Paxata, and Trifacta that have products that can perform the above data preparation operations on large volumes of data at scale.
In order for machine-learning systems to learn, they need to be trained with data that represents the thing the system needs to know. Obviously, as detailed above, that data needs to not only be good quality, but it needs to be “labeled” with the right information. Simply having a bunch of pictures of cats doesn’t train the system unless you tell the system that those pictures are cats -- or a specific breed of cat, or just an animal, or whatever it is you want the system to know. Computers can’t put those labels on the images themselves, because it would be a chicken-and-egg problem. How can you label an image if you haven’t fed the system labeled images to train it on?
The answer is that you need people to do that. Yes, the secret heart of all AI systems is human intelligence that labels the images systems later use to train on. Human-powered data labeling is the necessary component for any machine-learning model that needs to be trained on data that hasn't already been labeled. There are a growing set of vendors that are providing on-demand labor to help with this labeling, so companies don't have to build up their own staff or expertise to do so. Companies like CloudFactory, Figure Eight, and iMerit have emerged to provide this capability to organizations that are wise enough not to build up their own labor force for necessary data labeling.
Eventually, there will be a large amount of already trained neural networks that can be used by organizations for their own model purposes, or extended via transfer learning to new applications. But until that time, organizations need to deal with the human-dominated labor involved in data labeling, something Cognilytica has identified takes up to 25% of total machine-learning project time and cost.
AI helping Data Preparation
Even with all this activity in data preparation and labeling, Cognilytica sees that AI will have an impact on this process. Increasingly, data preparation firms are using AI to automatically identify data patterns, autonomously clean data, apply normalization and augmentation based on previously learned patterns and aggregate data where necessary based on previous machine-learning projects. Likewise, machine learning is being applied to data labeling to speed up the process by suggesting potential labels, applying bounding boxes, and otherwise speeding up the labeling process. In this way, AI is being applied to help make future AI systems even better.
The final conclusion of this report is that the data side of any machine-learning project is usually the most labor-intensive part. The market is emerging to help make those labor tasks less onerous and costly, but they can never be eliminated. Successful AI projects will learn how to leverage third-party software and services to minimize the overall cost and impact and lead to quicker real-world deployment.
Ronald Schmelzer, columnist, is senior analyst and founder of the Artificial Intelligence-focused analyst and advisory firm Cognilytica, and is also the host of the AI Today podcast, SXSW Innovation Awards Judge, founder and operator of TechBreakfast demo format events, and an expert in AI, Machine Learning, Enterprise Architecture, venture capital, startup and entrepreneurial ecosystems, and more. Prior to founding Cognilytica, Ron founded and ran ZapThink, an industry analyst firm focused on Service-Oriented Architecture (SOA), Cloud Computing, Web Services, XML, & Enterprise Architecture, which was acquired by Dovel Technologies in August 2011.