Microsoft’s latest acquisition shows speech recognition is big business

Gerard Julien/AFP via Getty Images

The tech giant will buy Nuance, a company that provides speech recognition and artificial intelligence.

Microsoft this week announced it will acquire Nuance, a Boston-based speech recognition and artificial intelligence company, for around $16 billion. It’s the company’s largest acquisition after LinkedIn and a big bet on speech recognition technology.

Nuance is used most in health care. About 10,000 health care facilities worldwide use it to capture conversations between patients and doctors and transcribe them in real time. I spoke with Daniel Hong, a research director at Forrester. He told me that a controlled environment like a clinic or doctor’s office can make the tech more accurate. The following is an edited transcript of our conversation.

Daniel Hong (Photo courtesy of Forrester)

Daniel Hong: In medicine, it’s pretty finite, the way people will express themselves about pain, about discomfort, and how the doctor can also talk about prescriptions. So if you have something that’s a bit more finite, the accuracy rates actually are higher.

Molly Wood: It sounds like you’re describing kind of two paths for interaction with voice recognition technologies. And one is sort of this high-accuracy, maybe low-flexibility, interaction, right? Where you’re using a lot of bank terms, using a lot of medical terms. And then, the other is this sort of free-floating, “I’m going to say whatever I want to Google.”

Hong: The Googles and the Amazons and Apples are actually pretty good with general questions. But now we’re seeing a little bit more complexity in terms of the questions that you can ask. So you can ask, “How old was Joe Biden’s wife when he became president?” Another area is context, having that conversation with Google Homes and Amazon Alexas. “How’s the weather in Florence? What about Paris? What about the temperature in these areas in Celsius?” And that’s an area where we’ve seen quite a bit of improvement over the last few years. But there’s still a long way to go in order to have that natural dialogue and full conversation with a speech recognition system.

Wood: So, of course, Google, Amazon and Apple have been in this space for a while now, from the consumer perspective, of course. But how far ahead are they in the competition?

Hong: When it comes to speech, it’s really about how much data you have, how many utterances you’re capturing, from people that are engaging that speech application. And when you’re looking at the whole consumerism of speech, and you have all these devices, from the smartphones to the smart speakers, it’s just a lot more data. So if they’re capturing all that data, and they’re using machine learning, deep neural nets, to constantly improve and refine the accuracy and understanding what the consumer is trying to say, then they are kind of light-years ahead of others.

Wood: What do you think are the biggest growth areas you see for speech recognition? I mean, I’m surprised at how many people I know who still don’t just dictate every single thing to their phone like I do.

Hong: Yeah, I think we’re just gonna see the application of speech being used more on a day-to-day basis to conduct tasks. So I think, as a consumer in the home, you’re able to operate lights, you’re able to use that as command-and-control features of searching for things on your television, to getting alerts and actually talking to some of your smart refrigerators, talking to your car and interacting in that way, changing the temperature control, [and] so forth. It’s essentially a user interface. We’ve been so used to just typing or turning things on and off, [that] we’ll be more and more inclined to use voice as the technology improves. And actually, there are a lot more microphones in the house to capture all that. In addition, I think that there will be a lot more innovation when it comes to the use of voice in the future, of using voice as a good way to interact, like coaching or having an assistant there, like your own personal assistant [and] being able to kind of interact with you via voice. There’s also things in health care, like with health checks [and] being able to diagnose just through the use of voice. I think we’re really on the cusp of voice being able to have a lot more innovative use cases. It’s just we need to get the accuracy there. And we have to essentially get more devices with voice interfaces out there to consumers.

Wood: Certainly, the reaction that is inevitable when you say “more microphones” is a question of privacy, especially if we start using this in doctors’ offices. Is there, do you think, a corresponding effort to do on device processing or make sure that recordings aren’t kept? Or is that still pretty much Wild West?

Hong: I think it’s a bit of both. I think because of [personal health information] and HIPAA, there’s a lot of compliance [required] in health care. So a lot of these systems and hospitals, and even those being used by health care insurers, have to meet these very strict guidelines to be able to have the technology rolled out to their patients and to their members. That said, it’ll have to evolve over time. You should probably let the patient know that this is being recorded and they have to have the opt-in capability. All these things will be ironed out as speech becomes more mainstream in health care and in other areas of the consumer’s world.

Wood: In terms of accuracy and the ability to respond to a query, ultimately, is that driven by the database on the back end as much as the actual speech recognition? Like, I’m thinking of how Alexa doesn’t use Google, but Google does. And as a result, I can say to Google, “What’s that movie about the kids in high school that has Matthew McConaughey in it?” And Google will know that, but Alexa won’t. And I just wonder how much of that is driven by the sources they’ve chosen to pull from.

Hong: It is. It’s that. I mean, if you break it down, you’re going to break down what is being said, is one part. And the other part would be, what is this person asking? So there’s kind of two different parts there. Once you get that, then you can identify what this person is asking. And then, you have to match that to the information that they’re looking for or whatever it is that they want to get done. And Google has Google search results. Google has a lot of information that’s synced to the Google speech experience. So I think that could be a reason why you’ll get perhaps more answers on that Google front than some of your Alexa devices.

An updated "Echo Dot" (left) is pictured next to an older generation "Echo Dot" at Amazon Headquarters, on Sep. 20, 2018, in Seattle Washington. — Two generations of Amazon’s voice-controlled Echo Dots. (Stephen Brashear/Getty Images)

The future of this podcast starts with you.

Every day, the “Marketplace Tech” team demystifies the digital economy with stories that explore more than just Big Tech. We’re committed to covering topics that matter to you and the world around us, diving deep into how technology intersects with climate change, inequity, and disinformation.

As part of a nonprofit newsroom, we’re counting on listeners like you to keep this public service paywall-free and available to all.

Support “Marketplace Tech” in any amount today and become a partner in our mission.