Mobile Platforms, User Experience

Cortana Opens Up Where Siri Remains a Recluse

A Big Step Forward that Leaves Much To Be Desired Cortana in Halo 4

Given Apple’s and Google’s dominance, not many of us follow Microsoft news anymore. But instead of coming apart at the seams, it looks like Microsoft is adopting the only credible strategy – trying to out-innovate its competition to the point where it becomes a leader again. Signs of success are visible with Azure becoming the most credible competition to AWS, and it seems like some if its artificial intelligence efforts are just as ambitious. Against that backdrop, the recent Cortana / Windows Speech Platform developments are steps in the right direction.

App vs. Platform

Back in September 2013 ahead of the iPhone 5 / iOS 6 launch we were trying to predict Apple’s next move. Siri launched a year earlier on the iPhone 4S, and our wager at the time (at Desti / SRI) was that iOS 6 will open Siri-as-a-platform, allowing application developers to tie their offerings into the speech-driven UX paradigm, bringing speech interaction to a critical mass. Guess what – 18 months later, Siri is still a limited, closed service, and even Google Now API is still a rumor. So Microsoft’s announcements last week is a breath of fresh air and potentially a strategic move. In a nutshell – here are the main points of what was announced (and here’s a link to the lecture at //build/):

  • Cortana available on all Windows platforms
  • 3rd party apps can extend Cortana by “registering” to respond to requests, e.g. “Tell my group on Slack that we will meet 30 minutes later”
  • Requests can be handled in the app, or the app can interact using Cortana’s dialog UI

Extending Cortana to Windows 10 is an important step towards making voice interaction with computers mainstream. Making Cortana pluggable turns it into a platform that can hope to be pervasive through a network effect. However – what was announced leaves much to be desired with regards to both platform strategy and platform capabilities.

Cortana API: Speech without Natural Language is an Unfinished Bridge

I’m a frequent user of Siri. There are simply many situations where immediate, hands-free action is the quickest / safest way to get some help or to record some information. One of Siri’s biggest issues in such situations is its linear behavior – once it goes down a path, its very hard to correct and go down another. Consider for instance searching for gas stations while you’re driving down a highway – you get a list of stations and then it kind of cycles through them by order of distance (which is not very helpful if you’ve already past something). But going back and forth in that list (“show me the previous one”) or adding something to your intent (“show me the one near the airport”) is impossible. So often you end up going back to tapping and typing. That’s where a more powerful natural-language-understanding platform is needed, e.g. SRI’s VPA, or potentially (now owned by Facebook) or Cortana’s API allows you to create rudimentary grammars where you more-or-less need to literally specify exactly the sentences your app should understand, with rudimentary capabilities to describe sentence templates. There is no real notion of synonyms, pursuing intent completion (i.e. “filling all the mandatory fields in the form”), going back to change something etc. So this is more or less an IVR-specification platform, and we all know how we love IVRs, right?  If you want to do more – the app can get the text and “parse it” itself. That means that every app developer that wants to go beyond the IVR model needs to be learn how to build a natural-language-understanding system. That’s not how platforms work, and will not support the proliferation of this mode of interaction – crucial for making Cortana a strategic asset. Now arguably you could say – well, maybe they never saw it as a strategic asset, maybe they were just towing the line set by Apple and Google. That, however, would be a missed opportunity.

Speech-enabling Things is a Credible Platform Strategy

The Internet of Things is coming, and it is going to be an all-encompassing experience – after all, we are surrounded by things. For many reasons, these things will not all come from the same company. A company that will own a meaningful part of the experience of these things and make them dependent on its platform – for UI, for personal data, for connectivity etc. – that company would own the user experience for so much of the user’s world. In other words – give these device makers a standardized, integrated interaction platform for their devices and you own billions of consumers’ lives. Cortana in the clowd can be a (front-end to) a platform that 3rd party developers can use to speech-enable interactions with devices – whether they make the devices (e.g. the wearable camera that needs to upload images taken) or the experiences that use them (e.g. activating Pandora on your wireless speaker). Give these app / device developers a way to create this experience and connect it to the user’s personal profile (that he/she already accesses through their laptop, smartphone, tablet etc.) and you become the glue that holds the world together. This type of software-driven platform play is exactly the strategy Microsoft’s excelled at for so many years. To be an element of such a strategy, Cortana needs to be a cloud service. Not just a service available across Windows devices, but a cloud-based platform-as-a-service that can integrate with non-Windows Things. That can be part of a wider strategy of IoT-focused platform-as-a-service (for instance – connecting your things to your personal profile, so they can recognize you and interact in a personalized context), but mostly it needs to be Damn Good. Cause Google is coming. Building a platform ecosystem and then sucking it for all its worth used to be Microsoft’s forte. Cortana in the cloud, as a strong NLU and speech platform could be an important element of its comeback strategy.


The Case for Siri

Since Siri’s public debut as a key iPhone feature 18 months ago, I keep getting involved in conversations (read: heated arguments) with friends and colleagues, debating whether Siri is the 2nd coming or the reason Apple stock lost 30%. I figure it’d be more efficient to just write some of this stuff down…

siri icon

Due Disclosure:

I run Desti, an SRI International spin-out that is utilizes post-Siri technology. However, despite some catchy headlines, Desti is not “Siri for Travel”, nor do I have any vested interest in Siri’s success. What Desti is, however, is the world’s most awesome semantic search engine for travel, and that does provide me some perspective on the technology.

Oh, and by the way, I confess, I’m a Siri addict.

Siri is great. Honest.

The combination of being very busy and very forgetful, means there are at least 20 important things that go through my mind every day and get lost. Not forever – just enough to stump me a few days later.  Having an assistant at my fingertips that allows me to do some things – typically set a reminder, or send an immediate message to someone – makes a huge difference in my productivity. The typical use-case for me is driving or walking, realizing there is something I forgot, or thinking up a great new idea and knowing that I will forget all about it by the time I reach my destination. These are linear use cases, where the action only has a few steps (e.g. set a reminder, with given text, at a given time) and Siri’s advantage is simply that it allows me to manipulate my iPhone immediately, hands-free, and complete the action in seconds. I also use Siri for local search, web search and driving directions.

Voice command on steroids – is that all it is?

Frankly – yes. When Siri made its public debut as an independent company, it was integrated with many 3rd party services that were scrapped and replaced with deep integration with the iPhone platform when Apple re-launched it. Despite my deep frustration with Siri not booking hotels these days, for instance (not), I think the decision to do one thing really well – provide a hands-free interface to core smartphone functionality (we used to call it PIM, back in the days), was the right way to go. Done well, and marketed well, this makes the smartphone a much stronger tool.

But I hate Siri. It doesn’t understand Scottish and it doesn’t tell John Malkovich good jokes

As mentioned, I’ve run into a lot of Siri-bashers in the last year. Generally they break down into two groups. The people who say Siri never understands them, and the people who say Siri is stupid. I’m going to discuss the speech recognition story in a minute (SRI spin-out, right?) but regarding the latter point I have to say two things. First, most people don’t really know what the “right” use-cases for Siri are. Somewhere between questionable marketing decisions and too little built-in tutorial, I find that people’s expectations of Siri are often closer to a “talking replacement for Google, Wikipedia and the bible” than to what Siri really is. That is a shame; because the bottom line is that it is under-appreciated by many people who could really put it to good use. Apple marketing is great, but it’s better at drawing a grand vision than it is at explaining specific features (did I mention my loss on my AAPL?). While the Siri team has done great work at giving Siri a character, at the end of the day it should be a tool, not an entertainment app (my 8-year old daughter begs to differ, though).

OK, but it still doesn’t understand ME

First, let me explain what Siri is. Siri is NOT voice-recognition software. Apple licenses this capability from Nuance. Siri is a system that takes voice recognition output – “natural language”, figures out what the intent is – e.g send an email, then goes through a certain conversational workflow to collect the info needed to complete that intent. Natural language understanding is a hard problem, and weaving multiple possible intents with all the possible different flows is complex. It is hard because there is a multitude of ways for people to express the same intent, and errors in the speech recognition add complexity. Siri is the first such system to do it well and certainly the first one to do it well on such a massive scale.

So what? If it doesn’t understand what I said, it doesn’t help me.

That is absolutely true. If speech is not recognized – garbage in, garbage out. Personally I find that despite my accent Siri usually works well for me, unless I’m expressing foreign names, or there is significant ambient noise (unfortunately, we don’t all drive Teslas). There are however some design flaws that do seem to repeat themselves.

In order to improve the success rate of the automatic speech recognizer (ASR), Siri seems to communicate your address book to it. So names that appear in your address book are likely to be understood, despite the fact they may be very rare words in general. However this is often overdone, and these names start dominating the ASR output. One problem seems to be that Nuance uses the first and last names as separate words, so every so often I will get “I do not know who Norman Gordon is” because I have a Norman Winarsky and a Noam Gordon as contacts. I believe I see a similar flaw when words from one possible intent’s domain (e.g. sending an email) are recognized mistakenly when Siri already knows I’m doing something else (e.g. looking at movie listings).

This probably says something about the integration between the Nuance ASR and Apple’s Siri software. It looks like there is off-line integration – as in transferring my contacts’ names a-priori, but no real-time integration – in this case Siri telling the ASR that “Norman Gordon” is not a likely result. Such integration between the ASR and the natural language understanding software is possible, but often complex not just for technical reasons but for organizational reasons. It requires very close integration that is hard to achieve between separate companies.

So when will it get better?

It will get better. Because it has to. Speech control is here to stay – in smartphones as well as TVs, cars and most other consumer electronics. ASRs are getting better, mostly for one reason. ASRs are trained by listening to people. The biggest hurdle is how much training data they have. In the early days of ASRs, decades ago, this consisted of “listening” to news commentators – people with perfect diction and accent, in a perfect environment. In the last year, more speech sample data was collected through apps like Siri then probably in the two decades prior, and this data is (can be?) tagged with location, context and user information, and is being fed back into these systems to train them. And as this explanation was borrowed from Adam Cheyer, Siri’s co-Founder and formerly Siri’s Engineering Director at Apple – you better believe it. We are nearing an inflection point, where great speech recognition is as pervasive as internet access.

So will Siri then do everything?

That’s actually not something I believe will happen as such. Siri is a user interface platform that has been integrated with key phone features and several web services. But to assume it will be the front-end to everything is almost analogous to assuming Apple will write all of the iOS apps. That is clearly not the case.

However – Siri as a gateway to 3rd party apps, as an API that allows other apps that need the hands-free, speech-driven UI to integrate into this user interface, could be really revolutionary. Granted – app developers will have to learn a few new tricks, like managing ontologies, resolving ambiguity, and generally designing natural language user experiences. Apple will need to build methodology and instruct iOS developers, and frankly this is a tad more complex than putting UI elements on the screen. Also I have no idea whether Siri was built as a platform this way, and can dynamically manage new intents, plugging them in and out as apps are installed or removed. But when it does, it enables a world where Siri can learn to do anything – and each thing it “learns”, it learns from a company that excels at doing it, because that is that third party’s core business.

… and then, maybe, a great jammy dodger bakery chain can solve the wee problem with Scotland with a Siri-enabled app.

Oh, and by the way – you can learn more about Siri, speech, semantic stuff and AI in general at my upcoming SXSW 2013 Panel – How AI is improving User Experiences. So come on, it will be fun.