July 2019

VOICE-BASED VIRTUAL ASSISTANTS HAVE COME CALLING!

RAJNIKANT RAO

Gone are the days when we said, "Lets Google it". The millennials say, "Hey Google!" or "Alexa!" or "Hey Siri!"

 

Haven’t we seen ads of Alexa and Google Home on local TV where songs are played, or the latest news is delivered, or informative general knowledge is easily dispensed just for the asking? Such devices are getting immensely popular and changing the fabric of how home entertainment works.

 

So, how do they work? Essentially, these virtual assistant devices have a mic (that’s short for microphone!) and a speaker. They have the circuitry to connect them to a WiFi. Therefore, when you ask a question, it’s captured by the mic and sent via WiFi to the respective server where the request is processed. The response from the server is sent back to the device from where the content is delivered via the speakers. This content may be a piece of information or a song. Today, you can ask via these devices to cast a YouTube video on to your TV or even play a Netflix film!

 

Now, voice-based virtual assistants are available on the phone, too. No more clumsy typing or even tapping on the screen. Just ask what you want the phone to do. This is the future of interaction on the phone. The creators of these technologies initially provided a simple way to get routine mobile tasks done via voice. These included: "Set an Alarm...", or "Call...", or "Send text to...". But that was a few years ago. Later, they added more capability like playing songs and so on. Today, almost any information that is available in the public domain is accessible via voice. Such apps are available on both iPhone as well as Android phone.

 

Another popular term is "chatbots". This usually refers to the virtual assistants that are available online. Customer support is the most popular application on websites that gets millions of support requests on a daily basis. Such requests are usually typed on a chat window on the website and are processed by a virtual assistant at the backend. Usually, chatbots are not voice-enabled.

 

This article focuses on Google Assistant and how a virtual assistant can be used for an organisation or association. But first some non-so technical understanding of how it all works.

 

The Google Assistant is a voice-enabled virtual assistant app built by Google. It is available for free on phones (both Android and iPhones) and allows you to ask for any information that is in the public domain. Just open the app and ask for it. Examples of information that can be asked are: "what is the latest news", or "when is the next eclipse", or "what is today's Sensex", or "when was GST implemented in India"? You will be surprised at how much of what you want to know can just be asked and answered. Easier and quicker.

 

The only difference between the Google Assistant app and the Google Home device is that the phone app has a screen (the phone) to show information besides speaking out the answer. In fact, now both Amazon (Alexa devices) as well as Google (Home devices) have devices with a screen. Think of them as specialised screens meant for the Assistant app.

 

What powers the ability of such applications and devices?

 

The first technology is the Speech-to-Text engine, or S2T. Its job is to convert the speech into text as accurately as possible. Imagine the challenge of such a system to understand all the different ways in which humans speak. Each one of us has a different tone of voice, different speeds at which we speak, the depth / shrillness of our voice and our own style / accent while speaking. Even when watching movies we know the difference in understanding the words spoken in an American film as opposed to a British one, and how different it is from an Aussie accent. The S2T engine must have the ability to support all this. To add to this is the external noises that cannot be avoided. Imagine, you are in a local train or bus and asking for information via the virtual assistant. It needs to recognise the difference in the sound that comes from you and those external sounds that penetrate the mic. Once it does that, it should ignore those extraneous sounds. And all that is done today by the S2T engine.

Unknown to most of us, today’s technology has been improvised thanks to the work done over decades. In the beginning, the quality of the S2T engines was very poor and required the user to “speak” her / his voice for a few hours to get it recognised. Today, it works with no training or very limited training. The magic behind this is a statistical method called "Machine Learning". Millions of sample voices of different dialects and regions and people have been fed into massive computers. Each of these samples also has the actual words listed which are fed into the computers. Statistical algorithms crunch all this data and come up with what is called a "model". The words spoken by the user are fed into this model which predicts the likely text being spoken. You will be amazed at its accuracy!

 

For this to work, the voice spoken on the phone is sent to large servers sitting in a "cloud" to process and return the equivalent text. This is then moved to the next stage. In fact, in the next one year even this step of sending the text to the “cloud” will be eliminated and the voice will get transcribed on the phone itself, making it almost instantaneous!

 

Assuming that it has correctly transcribed the spoken words, the system next needs to interpret them correctly. This is the most difficult part of the entire process. It is called “Natural Language Processing” or NLP. Some also call it “Natural Language Understanding” or NLU. Imagine that you are a librarian who has access to a vast body of knowledge. When someone approaches you with a question, you understand the query and, thanks to your knowledge of the library, you go to the right section to dig out the information and give it out. That ability is the job of the NLP. Since the request is now known (after getting converted by the S2T engine), the NLP needs to first figure out what the information is about. Is it about a person, or about some geographical data, or about some prices, or about current affairs? Possibly, for each of these categories of information, there is a source available that can provide the information. Much like the different sections of the library.

 

Have you tried asking any of the virtual assistants for the latest news? If not, please do. How would people ask for the latest news? "Tell me the latest news", or "What is the news now", or "What is happening in the world now"? People will not have a standard way of asking for a particular bit of information. Each one of us has our own style and choice of words. The NLP needs to understand that all these are different ways of asking and mean the same thing, viz., "Tell me the news". Once it has established what the user is asking, the response will be something like, "The news as per... is..."

What the NLP engine is doing is simple; having interpreted the request to be asking for news, it gets the information from one of the popular sources to which it is linked. It could be BBC, or Times News Network, or any other source with which it has a relationship.

 

Can you guess what happens when we ask the system to play a song? Well, once it establishes that it is a song that it is being requested to play, it will immediately forward the request to the songs library which could be Google Music or Saavn or Gaana from where the song is played.

 

Have you tried asking information about a person? Even if the person is not very famous, the Google Assistant will provide some info with links to its source. How does it do it? When it detects that you are asking about a person, it usually goes to one of its two popular sources, Linked In or Wikipedia, and delivers the best-guess person's details. In case there are many people with the same name, it will use some other criterion to decide which amongst them it would choose.

 

The effectiveness of the voice-based digital assistant primarily lies in the NLP engine rightly detecting what the user is asking for and retrieving the relevant information from one of its sources. This is called determining the “Intent” of the request.

 

Can it go wrong? Of course! Just like humans can make mistakes, the NLP, too, would. Besides, the NLP is not as wise as a human. It does not have the versatility of a human being. But over time it does a pretty good job. The first point of failure can come where the S2T engine does not transcribe your speech correctly. Perhaps, re-asking it with greater care would solve that problem. Then, when you ask for information about a person, it could so happen that it picks another person with a name similar to yours. In which case, perhaps, the query should be more refined. At times it may misunderstand the category. You are asking about a place while it may misunderstand it to be something else. Most users of virtual assistants accept that it is not perfect, yet it serves an important function and seems to be improving over time.

 

Since the Google Assistant is such a wonderful and easily-used app, how do we enable it to ask information that is private or local to a company? For example, would it not be convenient to query the HR manual of a company using such a feature? Or training all the office personnel on the products of a company? Or know the rules of GST for a particular category of products? Just by asking. Sounds like a perfect fit, doesn't it?

Google Assistant has a feature whereby an organisation or association (like the BCA) can set up its own channel. Google calls this an “Action”. In such cases, user requests are not processed by the Google engine but by the company's engine. Let’s take an example. Suppose BCA wishes to provide information to its members which is similar to what its website provides today.

 

BCA can inform Google that it wishes to set up a Channel called, say, "Chartered News". What this will do is that if the user says, "Talk to Chartered News", the request will be passed on to the BCA's server for processing. It will not be processed by Google. Now, all that BCA needs to do is to have some relevant software put in place which will "understand" the request and give a suitable response. And this will continue for all requests that the user makes until the user says "Goodbye". If required, such a channel can be restricted to only the members of BCA.

 

This is an extremely potent manner in which the future of all information is likely to be dispensed. There are tools available that will help organisations create such a channel quite easily. These tools will have to be configured to understand the query based on the content that is put up by the organisation.

 

Where is the technology moving?

 

Well, today Google Assistant supports Hindi and has announced that it will soon be adding other Indian languages such as Gujarati, Kannada, Urdu, Bengali, Marathi, Urdu, Tamil, Telugu and Malayalam. New phones (like Nokia 3.2 and Nokia 4.2) are being introduced which have a dedicated “Google Assistant” button. This makes it more convenient for users to access the virtual assistant. Just click on it and ask!

 

This is the new reality: Virtual assistants are the new way to access information. If you have not started yet, please do so or you will be left behind!

Past Issues

Flip-Book
HTML View
Current Issue