July 2019

VOICE-BASED VIRTUAL ASSISTANTS HAVE COME CALLING!

By RAJNIKANT RAO

Reading Time 11 mins

Gone are the days when we
said, “Lets Google it”. The millennials say, “Hey Google!”
or “Alexa!” or “Hey Siri!”

Haven’t we seen ads of
Alexa and Google Home on local TV where songs are played, or the latest news is
delivered, or informative general knowledge is easily dispensed just for the
asking? Such devices are getting immensely popular and changing the fabric of
how home entertainment works.

So, how do they work?
Essentially, these virtual assistant devices have a mic (that’s short for
microphone!) and a speaker. They have the circuitry to connect them to a WiFi.
Therefore, when you ask a question, it’s captured by the mic and sent via WiFi
to the respective server where the request is processed. The response from the
server is sent back to the device from where the content is delivered via the
speakers. This content may be a piece of information or a song. Today, you can
ask via these devices to cast a YouTube video on to your TV or even play a
Netflix film!

Now, voice-based virtual
assistants are available on the phone, too. No more clumsy typing or even
tapping on the screen. Just ask what you want the phone to do. This is the
future of interaction on the phone. The creators of these technologies
initially provided a simple way to get routine mobile tasks done via voice.
These included: “Set an Alarm…”, or “Call…”, or
“Send text to…”. But that was a few years ago. Later, they added more
capability like playing songs and so on. Today, almost any information that is
available in the public domain is accessible via voice. Such apps are available
on both iPhone as well as Android phone.

Another popular term is
“chatbots”. This usually refers to the virtual assistants that are
available online. Customer support is the most popular application on websites
that gets millions of support requests on a daily basis. Such requests are
usually typed on a chat window on the website and are processed by a virtual
assistant at the backend. Usually, chatbots are not voice-enabled.

This article focuses on Google
Assistant and how a virtual assistant can be used for an organisation or
association. But first some non-so technical understanding of how it all works.

The Google Assistant is a voice-enabled virtual assistant app
built by Google. It is available for free on phones (both Android and iPhones)
and allows you to ask for any information that is in the public domain. Just
open the app and ask for it. Examples of information that can be asked are:
“what is the latest news”, or “when is the next eclipse”,
or “what is today’s Sensex”, or “when was GST implemented in
India”? You will be surprised at how much of what you want to know can
just be asked and answered. Easier and quicker.

The only difference
between the Google Assistant app and the Google Home device is that the phone
app has a screen (the phone) to show information besides speaking out the
answer. In fact, now both Amazon
(Alexa devices) as well as Google (Home devices) have devices with a screen.
Think of them as specialised screens meant for the Assistant app.

What powers the ability of
such applications and devices?

The first technology is
the Speech-to-Text engine, or S2T. Its job is to convert the speech into text
as accurately as possible. Imagine the challenge of such a system to understand
all the different ways in which humans speak. Each one of us has a different
tone of voice, different speeds at which we speak, the depth / shrillness of
our voice and our own style / accent while speaking. Even when watching movies
we know the difference in understanding the words spoken in an American film as
opposed to a British one, and how different it is from an Aussie accent. The
S2T engine must have the ability to support all this. To add to this is the
external noises that cannot be avoided. Imagine, you are in a local train or
bus and asking for information via the virtual assistant. It needs to recognise
the difference in the sound that comes from you and those external sounds that
penetrate the mic. Once it does that, it should ignore those extraneous sounds.
And all that is done today by the S2T engine.

Unknown to most of us,
today’s technology has been improvised thanks to the work done over decades. In
the beginning, the quality of the S2T engines was very poor and required the
user to “speak” her / his voice for a few hours to get it recognised. Today, it
works with no training or very limited training. The magic behind this is a
statistical method called “Machine Learning”. Millions of sample
voices of different dialects and regions and people have been fed into massive
computers. Each of these samples also has the actual words listed which are fed
into the computers. Statistical algorithms crunch all this data and come up
with what is called a “model”. The words spoken by the user are fed
into this model which predicts the likely text being spoken. You will be amazed
at its accuracy!

For this to work, the
voice spoken on the phone is sent to large servers sitting in a
“cloud” to process and return the equivalent text. This is then moved
to the next stage. In fact, in the next one year even this step of sending the
text to the “cloud” will be eliminated and the voice will get transcribed on
the phone itself, making it almost instantaneous!

Assuming that it has
correctly transcribed the spoken words, the system next needs to interpret them
correctly. This is the most difficult part of the entire process. It is called
“Natural Language Processing” or NLP. Some also call it “Natural Language
Understanding” or NLU. Imagine that you are a librarian who has access to a
vast body of knowledge. When someone approaches you with a question, you
understand the query and, thanks to your knowledge of the library, you go to
the right section to dig out the information and give it out. That ability is
the job of the NLP. Since the request is now known (after getting converted by
the S2T engine), the NLP needs to first figure out what the information is
about. Is it about a person, or about some geographical data, or about some
prices, or about current affairs? Possibly, for each of these categories of
information, there is a source available that can provide the information. Much
like the different sections of the library.

Have you tried asking any
of the virtual assistants for the latest news? If not, please do. How would
people ask for the latest news? “Tell me the latest news”, or
“What is the news now”, or “What is happening in the world
now”? People will not have a standard way of asking for a particular bit
of information. Each one of us has our own style and choice of words. The NLP
needs to understand that all these are different ways of asking and mean the
same thing, viz., “Tell me the news”. Once it has established what
the user is asking, the response will be something like, “The news as
per… is…”

What the NLP engine is
doing is simple; having interpreted the request to be asking for news, it gets
the information from one of the popular sources to which it is linked. It could
be BBC, or Times News Network, or any other source with which it has a
relationship.

Can you guess what happens
when we ask the system to play a song? Well, once it establishes that it is a
song that it is being requested to play, it will immediately forward the
request to the songs library which could be Google Music or Saavn or Gaana from
where the song is played.

Have you tried asking
information about a person? Even if the person is not very famous, the Google
Assistant will provide some info with links to its source. How does it do it?
When it detects that you are asking about a person, it usually goes to one of
its two popular sources, Linked In or Wikipedia, and delivers the best-guess
person’s details. In case there are many people with the same name, it will use
some other criterion to decide which amongst them it would choose.

The effectiveness of the
voice-based digital assistant primarily lies in the NLP engine rightly
detecting what the user is asking for and retrieving the relevant information
from one of its sources. This is called determining the “Intent” of the
request.

Can it go wrong? Of
course! Just like humans can make mistakes, the NLP, too, would. Besides, the
NLP is not as wise as a human. It does not have the versatility of a human
being. But over time it does a pretty good job. The first point of failure can
come where the S2T engine does not transcribe your speech correctly. Perhaps,
re-asking it with greater care would solve that problem. Then, when you ask for
information about a person, it could so happen that it picks another person
with a name similar to yours. In which case, perhaps, the query should be more
refined. At times it may misunderstand the category. You are asking about a
place while it may misunderstand it to be something else. Most users of virtual
assistants accept that it is not perfect, yet it serves an important function
and seems to be improving over time.

Since the Google Assistant
is such a wonderful and easily-used app, how do we enable it to ask information
that is private or local to a company? For example, would it not be convenient
to query the HR manual of a company using such a feature? Or training all the
office personnel on the products of a company? Or know the rules of GST for a
particular category of products? Just by asking. Sounds like a perfect fit,
doesn’t it?

Google Assistant has a
feature whereby an organisation or association (like the BCA) can set up its
own channel. Google calls this an “Action”. In such cases, user requests are
not processed by the Google engine but by the company’s engine. Let’s take an
example. Suppose BCA wishes to provide information to its members which is
similar to what its website provides today.

BCA can inform Google that
it wishes to set up a Channel called, say, “Chartered News”. What
this will do is that if the user says, “Talk to Chartered News”, the
request will be passed on to the BCA’s server for processing. It will not be
processed by Google. Now, all that BCA needs to do is to have some relevant software
put in place which will “understand” the request and give a suitable
response. And this will continue for all requests that the user makes until the
user says “Goodbye”. If required, such a channel can be restricted to
only the members of BCA.

This is an extremely
potent manner in which the future of all information is likely to be dispensed.
There are tools available that will help organisations create such a channel
quite easily. These tools will have to be configured to understand the query
based on the content that is put up by the organisation.

Where is the technology
moving?

Well, today Google
Assistant supports Hindi and has announced that it will soon be adding other
Indian languages such as Gujarati, Kannada, Urdu, Bengali, Marathi, Urdu, Tamil,
Telugu and Malayalam. New phones (like Nokia 3.2 and Nokia 4.2) are being
introduced which have a dedicated “Google Assistant” button. This makes it more
convenient for users to access the virtual assistant. Just click on it and ask!

This is the new reality:
Virtual assistants are the new way to access information. If you have not
started yet, please do so or you will be left behind!

Download PDF

Journal

Resources

About

Reach Us

Audio Version

July 2019

VOICE-BASED VIRTUAL ASSISTANTS HAVE COME CALLING!

You May Also Like

INTERVIEW GAURAV ANAND, STARTUP CO-FOUNDER

Journal

Resources

About

Reach Us

Audio Version

July 2019

VOICE-BASED VIRTUAL ASSISTANTS HAVE COME CALLING!

Share

You May Also Like

INTERVIEW GAURAV ANAND, STARTUP CO-FOUNDER