Why the Love Affair with Deep Learning?
One of the factors often credited for this latest boom in artificial intelligence (AI) investment, research, and related cognitive technologies, is the emergence of Deep Learning AI algorithms, and the corresponding large volumes of big data and computing power that makes Deep Learning a reality. However, deep learning is just an approach to machine learning (ML), that while having proven much capability across a wide range of problem areas, is still just one particular approach. Increasingly, we’re starting to see news and research showing the limits of deep learning, and some of the downsides to the deep learning approach. So we have to ask, are people’s enthusiasm of AI tied to their enthusiasm of deep learning, and is deep learning really able to deliver on many of its promises?
The Origins of Deep Learning’s Promises
AI researchers have struggled to understand how the brain learns from the very beginnings of the development of the field of artificial intelligence. It comes as no surprise that since the brain is primarily a collection of interconnected neurons, AI researchers sought to recreate the way the brain is structured through artificial neurons, and specifically neural networks. Way back in 1940, Walter Pitts and Warren McCulloch built the first “thresholded logic unit” that was an attempt to mimic the way neurons worked. It was just a proof of concept primarily, but Frank Rosenblatt picked up on the idea in 1957 with the perceptron, that remarkably could recognize written numbers and letters, and even distinguish male from female faces. That was a whole 61 years ago!
Rosenblatt was so enthusiastic in 1959 about the Perceptron’s promises that he even remarked that the perceptron is “the embryo of an electronic computer that [we expect] will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.” Sound familiar? But of course the enthusiasm didn’t last. AI researcher Marvin Minsky noted how sensitive the perceptron was to small changes in the images, and also how easily they could be fooled. Maybe the peceptron wasn’t really that smart at all. Minsky and AI researcher peer Seymour Papert basically took apart the whole perceptron idea in their Perceptrons book, and made the claim that perceptrons, and neural networks like it, are fundamentally flawed in their inability to handle certain kinds of problems — notably, “non-linear functions”. That is to say, it was easy to train a neural network like a perceptron to put data into classifications, such as male/female, or types of numbers. For these simple neural networks, you can graph a bunch of data and draw a line and say things on one side of the line are in one category and things on the other side of the line are in a different category, thereby classifying them. But there’s a whole bunch of problems where you can’t draw lines like this, such as speech recognition or many forms of decision-making. These are non-linear functions, which Minsky and Papert proved perceptrons incapable of solving.
During this period, while neural network approaches to ML settled to become an afterthought in AI, other approaches to ML were in the limelight including knowledge graphs, decision trees, genetic algorithms, similarity models, and other methods. In fact, during this period, IBM’s DeepBlue purpose-built AI computer defeated Gary Kasparov in a chess match, the first computer to do so, using a brute-force alpha-beta search algorithm (so-called “Good Old-Fashioned AI” [GOFAI]) rather than new-fangled deep learning approaches. Yet, even this approach to learning didn’t go far, as some said that this system wasn’t even intelligent at all.
However, the neural network story doesn’t end here. In 1986, AI researcher Geoff Hinton, along with David Rumelhart and Ronald Williams, published a research paper entitled “Learning representations by back-propagating errors”. In this paper, Hinton and crew detailed how you can use many “hidden layers” of neurons to get around the problems faced by perceptrons. With sufficient data and computing power, these layers can be calculated to identify specific features in the data sets they can classify on, and as a group, could learn nonlinear functions, something known as the “universal approximation theorem”. The approach works by backpropagating errors from higher layers of the network to lower ones (“backprop”), expediting training. Now, if you have enough layers, enough data to train those layers, and sufficient computing power to calculate all the interconnections, you can train a neural network to identify and classify almost anything. Researcher Yann Lecun developed LeNet-5 at AT&T Bell Labs in 1998, recognizing handwritten images on checks using an iteration of this approach known as Convolutional Neural Networks (CNNs), and researchers Yoshua Bengio and Jürgen Schmidhube further advanced the field.
Yet, just as things go in AI, research halted when these early neural networks couldn’t scale. Surprisingly very little development happened until 2006, when Hinton re-emerged onto the scene with the ideas of unsupervised pretraining and deep belief nets. The idea here is to have a simple two-layer network whose parameters are trained in an unsupervised way, and then stack new layers on top of it, just training that layers parameters. Repeat for dozens, hundreds, even thousands of layers. Eventually you get a “deep” network with many layers that can learn and understand something complex. This is what deep learning is all about: using lots of layers of trained neural nets to learn just about anything. Or perhaps within certain constraints.
In 2010, Stanford researcher Fei-Fei Li published the release of ImageNet, a large database of millions of labeled images. The images were labeled with a hierarchy of classifications, such as animal or vehicle, down to very granular levels, such as husky or trimaran. This ImageNet database was paired with an annual competition called the Large Scale Visual Recognition Challenge (LSVRC) to see which computer vision system had the lowest number of classification and recognition errors. In 2012, Geoff Hinton, Alex Krizhevsky, and Ilya Sutskever, submitted their “AlexNet” entry that had almost half the number of errors as all previous winning entries. What made their approach win was that they moved from using ordinary computers with CPUs, to specialized graphical processing units (GPUs) that could train much larger models in reasonable amounts of time. The also introduced the now-standard deep learning methods such as “dropout” to reduce a problem called overfitting (when the network is trained too tightly on the example data and can’t generalize to broader data), and something called the rectified linear activation unit (ReLU) to speed training. After that submission, everyone took notice, and Deep Learning was off to the races.
Deep Learning’s Shortcomings
The fuel that keeps the Deep Learning fires roaring is data and compute power. Specifically, you need large volumes of well-labeled data sets to train Deep Learning networks. The more layers, the better the learning power, but to have layers you need to have data that is already well labeled to train those layers. Since deep neural networks are primarily a bunch of calculations that have to all be done at the same time, you need a lot of raw computing power — specifically numerical computing power. Imagine you’re tuning a million knobs at the same time to find the optimal combination that will make the system learn based on millions of pieces of data that are being fed into the system. This is why neural networks in the 1950s were not possible but today they are. We have lots of data and lots of computing power to handle that data.
Deep learning is being applied successfully in a wide range of situations, such as natural language processing, computer vision, machine translation, bioinformatics, gaming, and many other applications where classification, pattern matching, and the use of this automatically tuned deep neural network approach works well. However, these same advantages have a number of disadvantages.
The most notable of these disadvantages is that since deep learning consists of many layers, each with many interconnected nodes, each configured with different weights and other parameters there’s no way to inspect a deep learning network and understand how any particular decision, clustering, or classification is actually done. It’s a black box, which means deep learning networks are inherently unexplainable. As we’ve detailed in some of our other research on Explainable AI (XAI), any system that’s being used to make decisions of significance will eventually need to have explainability to satisfy issues of trust, compliance, verifiability, and understandability. While DARPA and others are working on ways to possibly explain deep learning neural networks, the lack of explainability is a significant drawback for many.
The second disadvantage is that deep learning networks are really great at classification and clustering of information, but not really good at other decision-making or learning scenarios. Not every learning situation is one of classifying something in a category or grouping information together into a cluster. Sometimes you have to deduce what to do based on what you’ve learned before. Deduction and reasoning is not a forté of deep learning networks.
As mentioned earlier, deep learning is also very data and resource hungry. AI Researcher Yoshua Bengio famously said when asked how many layers a deep learning network needs: “Very simple. Just keep adding layers until the test error does not improve anymore.” One measure of a neural network’s complexity is the number of parameters that need to be learned and tuned. For deep learning neural networks, there can be hundreds of millions of parameters. Training models requires a significant amount of data to adjust these parameters. For example, a speech recognition neural net often requires terabytes of clean, labeled data to train on. The lack of a sufficient, clean, labeled data set would hinder the development of a deep neural net for that problem domain. And even if you have the data, you need to crunch on it to generate the model, which takes a significant amount of time and processing power.
Another challenge of deep learning is that the models produced can be very specific to a problem domain. If it’s trained on a certain dataset of faces, then it will only recognize those faces and can’t be used to generalize on faces that may have different skin tones or hair color or be used to identify non-face images. While this is not a problem of only deep learning approaches to machine learning, it can be particularly troublesome when factoring in the overfitting problem mentioned above. Deep learning neural nets can be so tightly constrained (fitted) to the training data that, for example, even small perturbations in the images can lead to wildly inaccurate classifications of images. There are well known examples of turtles being mis-recognized as guns or polar bears being mis-recognized as other animals due to just small changes in the image data. Clearly if you’re using this network in mission critical situations, those mistakes would be significant.
Machine Learning is not (just) Deep Learning
Enterprises looking at using cognitive technologies in their business need to look at the whole picture. Machine learning is not just one approach, but rather a collection of different approaches of various different types that are applicable in different scenarios. Some machine learning approaches are very simple, using small amounts of data and an understandable logic or deduction path that’s very suitable for particular situations, while others are very complex and use lots of data and processing power to handle more complicated situations. The key thing to realize is that deep learning isn’t all of machine learning, let alone AI. Even Geoff Hinton, the “Einstein of deep learning” is starting to rethink core elements of deep learning and its limitations.
The key for organizations is to understand which approaches are most viable for which problem areas, and how to plan, develop, deploy, and manage that machine learning approach in practice. Since AI use in the enterprise is still nascent, especially these more advanced cognitive approaches, the best practices on how to employ cognitive technologies successfully are still maturing. Further complicating the picture are the morass of vendors and consulting firms that have their own perspectives and products to sell to clients, and don’t want to deliver the required full picture to enterprises.