With almost no fanfare, speech recognition technology has made tremendous strides in the last few years. It’s what you might call a stealth technology — the kind that keep academics and serious R&D departments busy for years showing incremental improvement, and then all at once the development reaches critical mass and it’s everywhere, in all sorts of applications. It promises to change the way customers interact with automated systems, broadening the range of telephony interactions, and giving the call center a strong new tool on the front-end for capturing customer data.
According to one analyst, the worldwide market for automated speech recognition products is projected to jump from $100 million in 1998 to $1.6 billion in 2002. The US market potential for speech-enabled auto attendant products is expected to grow from less than $10 million in 1998 to $250 million by 2001.
The reason it is so explosive is twofold. First, the speed and power of the typical PC grew along the expected curve until it was strong enough to process speech in real-time. Second, the developed algorithms were steadily improved to allow computers to discern the appropriate patterns that underlie speech, without regard to accent, speed of speech or other eccentricity.
Speech rec is starting to gain a toehold in call centers as an autoselector — a tool that the customer uses to interact with an automated system to either route himself to the proper person (an auto attendant or ACD front-end) or extract the information he needs from a host database, à la IVR.
In the short and medium term, the interaction of choice for a customer wanting information is still going to be the telephone. While they are migrating to the Internet in huge numbers, call centers will still be deluged with phone requests for information, service, problem solving and order taking. IVR is still the dominant way for callers to routes themselves to their information destinations. When you put an intelligent speech engine in front of that you decrease the chances that the customer will ultimately have to make use of an agent’s time. Costs are shaved and customers go away slightly more satisfied.
By itself, speech rec doesn’t add new functionality to the call center. Instead, it adds new callers: those with rotary phones, those who are mobile, those who are so pressed for time that they can’t be bothered to do anything but speak. It then processes those callers using the same traditional tools that call centers have used for years. The same benefits flow from speech recognition as from IVR: fewer calls that have to go to an agent, shorter calls, and more self-service.
This technology generates a lot of excitement in the public because of its association with things like voice typing, or dialing a cell phone by voice. But clearly, the specialty applications call centers need — those that need to be speaker-independent — are more powerful in the long term, with the potential to save agents time on data collection.
Consider an application created by Nuance Communications for Schwab’s automated brokerage system. When I first saw this demoed in 1996, I thought it was pretty good: it understood me more than half the time, and seemed flexible. Now it’s even better. And according to Nuance, it now handles half of Schwab’s daily telephone stock quote volume, with 97% accuracy. With the migration of personal financial services to the Internet (and with price and service the determining factor in a competitive industry), giving a customer the ability to say “I’d like a quote on IBM” instead of typing out some ridiculous code is a key differentiator.
There are a lot of companies working on applications for this. As processing power improves and the cost of delivering a working application drops, it is likely that speech rec will take over as a successor to IVR as the “non-agent” telephony transaction.
The kinds of input that a speech rec system would have to process are very well defined-sequences of digits for things like account numbers, phone numbers, social security IDs or passwords. Or, some apps use discrete letters for getting stock quotes. There are a million ways to use it to extract information.
There are two distinct kinds of speech recognition, known as speaker-dependent and speaker-independent. The two diverge wildly in the kinds of things they are good at, and the kinds of systems needed to make them run.
Call center apps necessarily focus on speaker-independent recognition. Many people will call, obviously. The human brain in the form of a receptionist can recognize a huge number of variations of the same basic input—there are literally an infinite number of ways to intonate the word “hello.” What you want in a call center is a system that will respond to the likely inputs—the most common words like yes, no, stop, help, operator, etc., the digits, the letters of the alphabet, and so on.
Telecom has gradually been accepting the technology in operator assistance and routing systems. (But not everywhere you think. Some automated applications that ask users for spoken input, like directory assistance, are actually just recording it and playing it for the operator, who inputs it manually—it saves time, but speech rec it isn’t.)
Internationally, touch tone penetration is still very low, leaving a vast installed base of potential callers who can not access IVR. It follows that these callers are then going to be expensive to process when they come into a call center because they have to be held in queue until there’s an agent ready for them — high telecom charges from the longer than average wait, coupled with the cost of agent-service (rather than self-service). On the down side, international call centers, particularly those that serve multiple countries, can field calls in multiple languages. If you use an IVR front-end to have the caller select their language then you, by definition, don’t need speech rec. These are surmountable problems that have more to do with the operation of speech rec in practice than with the underlying technology.
Speech rec costs a lot to develop and perfect, but once it’s done, it’s done forever. The cost of maintaining it is negligible, and it has little of the headaches involved in CTI or other “fancy” call center technologies. Once you tease meaning out of the speech, it becomes input like any other, just like information entered via the Web, DTMF or told to an agent.
That’s the essence of speech recognition in the call center — it’s a simple front-end, with albeit limited application. But that’s what they said about IVR ten years ago, and look where we are today.
Giant retailer Sears has turned to speech recognition in a big way, implementing it in 750 of their retail stores nationwide as part of a program to redirect calls more efficiently. It’s also hoped that this will help them re-deploy almost 3,000 people to other, more productive tasks.
The automated speech system, built by Nuance, is part of Sears’ Central Call Taking initiative. About three-quarters of the calls that come in to each of the stores’ general numbers will be handled by the system; for a total of 120,000 calls each day.
Callers, when prompted, will be able to say the name of the department they want to reach — “shoes,” for example. The cost of automating the department transfer is much lower than the cost of having an operator do it. (Reps are available to assist in case a customer has trouble.) Sears will be able to use this system instead of hiring temps during the busy holiday season and other peak periods.
The system was piloted at some Sears stores during a recent holiday season. Development began the prior summer, a fairly quick turnaround for a system complex enough to stand in for 3,000 operators at 750 locations. Sears integrated the Nuance system into their existing IT infrastructure; a custom app uses the speech rec system’s ODBC hooks to query a centralized Oracle database for individual department phone extensions, reducing call transfer time.
This is one of the biggest examples of speech rec being used in consumer apps outside financial services. When retailers take on a technology, that’s a sure sign they feel comfortable with both the consumer acceptance and the technical sophistication of it.
Airlines have also been historic early adopters. Airlines have been out in front in pushing this technology as a way for their customers to get quick access to a wealth of information. American Airlines has one such system, which they call Dial-AA-Flight.
The airline’s automated flight information system gives customers data on arrivals, departures and gates. According to American, it handles approximately 19 million customer calls a year.
AA is moving to add a speech front-end to the system, at first in a pilot program (no pun intended) that will take 10% of the traffic. This is not American’s first use of speech rec. In 1999 they rolled out a similar service to their VIP customers.
(It’s interesting that speech, which just a few years ago was ridiculously bad at speaker-independent recognition, is now moving into real world applications from the top down, from best customers into the general pool.)