Ever since the launch of Amazon’s Echo device in 2014, its seems that every month brings a new development in dedicated devices that process voice commands and perform actions. However, what exactly are these devices? The popular media calls them “smart speakers” or “voice assistants” or “intelligent personal assistants”, but these words describe very different concepts. A smart speaker conjures up a primarily output oriented device that aims to replace keyboard or button interaction with voice commands. Yet, that seems to be a particularly trivial application for the significant investments and competitive posture that Amazon, Google, Microsoft, Apple, Alibaba, Tencent, Baidu, and countless others are taking. After all, why are all these vendors so aggressively marketing and promoting these devices if all they do is allow you to play Taylor Swift on vocal demand or let you ask about the weather?
Clearly there’s a bigger play here than simply the smart speaker. The smart speaker is just a way to initially get their product into a larger number of households and businesses and get people comfortable with using these devices. The real play is something bigger than just a speaker you can control with your voice. The power is not in the speaker, but in the cloud-based technology that powers the device. These devices are really low-cost input and output hardware that are a gateway to the much more powerful infrastructure that sits at the major tech companies’ data centers. The device itself is the giveaway to this. You can even build your own full-featured conversational device for just a few dollars. So let’s dispense with the clearly ill-fitting term “smart speaker.” It belies the real power of these devices.
Not Smart Speakers. Intelligent Conversational Assistants
If you ask Amazon, Google, Microsoft, Apple, and others, you know that playing music, games, and responding to simply queries is not the end state of their vision for what these conversational gateway devices will be. This week’s demonstration of Google Duplex at Google I/O 2018 clearly shows the power of what an intelligent conversational assistant can truly be. Rather than just being passive devices, intelligent conversational assistants can proactively act on your behalf, performing tasks that require interaction with other humans, and perhaps soon, other conversational assistants on the other end. The power is not in the speaker device.
Indeed, where exactly is the device? The device (speaker) is completely missing in the Google Duplex scenario. We don’t see a device because a device is not necessary here as the devices are just gateways to the real activity that’s happening in the data centers. The conversational agent is acting completely behind the scenes from Google’s data center interacting through voice-over IP (VoIP) telephone lines with a human on the other end.
So, why are devices needed at all if they’re just gateways? They’re needed because they provide the user interface to the cloud-based intelligence services. Without a device, the only way to access these services is through a web, desktop, or mobile interface. But this is inefficient. Amazon wasn’t truly the first to bring voice-based assistants to market. Apple had them beat by over three years with Siri, and Google introduced their voice-based assistant in Android just a short while after. What made Amazon stand out though with their Echo devices is that the mobile phone was eliminated entirely. Rather than activating the device through a phone, you can simply speak in the comfort of whatever activity you’re doing and trigger intelligent capabilities. Basically, the value of the device is in its hands-free mode of interaction, but the intelligence of the device is in the back-end infrastructure.
How Intelligent Are These Devices?
Earlier this year, Cognilytica announced the creation of our Voice Assistant Benchmark. The purpose of the benchmark isn’t to test the natural language processing (NLP) or natural language generation (NLG) capabilities of the devices. Nor is the intent of the benchmark to see what sort of skills these devices can perform. We know that better NLP/NLG means the ability to handle a wider range of voices, accents, languages, and speaker characteristics, and more skills mean more single-task capabilities. Those are all “table stakes” as far as we’re concerned.
If the power of the devices is not in the device itself, but in the back-end intelligence that gives these devices real capabilities, then we need to test to see how intelligent that back end really is. Can the conversational agents understand when you’re comparing two things together? Do they understand implicit unspoken things that require common sense or cultural knowledge? For example, a conversational agent scheduling a hair appointment should know that you shouldn’t schedule a hair cut a few days after your last hair cut, or schedule a root canal dentist appointment right before a dinner party. These are things that humans can do because we have knowledge and intelligence and common sense. Yet as it stands and as we demonstrated in our initial benchmark, neither the Google Home nor Amazon Echo nor Apple Siri devices can answer the question “what’s larger, the sun or the earth?” Are these devices you’d trust running your life? Not yet. But, we aim to help move things in that direction.
The Implications of an Intelligent Conversational Assistant
In the not-so-distant future, intelligent assistants will be everywhere. We’ll be interacting with them daily in both our personal and business lives. We’ll be chatting with assistants in our homes, and also interacting with other people’s and business’s conversational agents. In a future where everyone will have a personal electronic virtual assistant, we’ll have them do everything from messaging friends when you’re putting together a birthday party, to scheduling all the logistics for that party, to dealing with inbound calls from late attendees who can’t make it. Soon enough, as dependent as we are now on our GPS systems from keeping us from getting lost and our mobile phones for keeping us always connected, we’ll be dependent on these intelligent assistants for keeping our lives in order. This is just an inevitable direction of where things are heading.
However, there’s a downside to the use of intelligent assistants. In a recent article in Verge, experts bemoan the fact that humans will want to know if they’re talking to a robot or not. Clearly people will be frustrated by the earlier generations of intelligent assistants as they make frustrating mistakes. Yet, there’s an even darker potential outcome. Criminals and mischief makers can use voice assistants to tie up phone lines, cause retail “denial of service” attacks by scheduling fake appointments, cause harm by faking information to people to get them to leave their houses or otherwise tie up resources. In the future, we’ll need a sure-fire way to make sure that we know who the speaker on the phone is, what their intentions are, and how real the requests are. The future, which is really here now, is that we can’t believe anything we see or hear. This makes verifying reality incredibly important in an AI-Enabled Future where intelligent assistants are part of our everyday lives.