Latest news about Bitcoin and all cryptocurrencies. Your daily crypto news habit.
Echo, Siri and Cortana are popular VUIs. However, all these assistants were designed with different goals in mind. Echo was designed as a voice first interface while Siri was designed as just another way of interacting with your iPhone. This is changing.
Speech recognition refers to what the recognition system hears. The recognition engine returns a set of responses for every user query. As this technology improves, the challenge of designing a great VUI lies in how the system responds. Natural language understanding(NLU) is how the VUI interprets those responses. It is the way input is handled, rather than the accuracy of correctly transcribing what was said.
Handling user responses
When you say âRead me this articleâ, I can interpret this as âRead me this articleâ or âReed me this article.â The second interpretation does not really make sense. As an intelligent person, I am expected to understand the context and the logic of the statement to respond to it. Likewise, an intelligent system needs to know what the user meant and respond to it accordingly. Here are the different types of responses that the system needs to handle:
Constrained/finite responses
For many of the questions that a system asks, there are only a finite set of responses that are logically valid. For example, when the system asks âwhat is your favorite animal?â or âDo you want to go outside?â Saying âorangeâ is not a logical response. Providing a mechanism to capture and interpret responses and discard irrelevant ones is especially challenging for a voice based system where the user can choose from a seemingly infinite response set. In the first case, the VUI will have a list of accepted animal names, while in the second one we only need to look at variations of âyesâ and âno.â If the user says something outside of the finite set of responses, it can be handled with a null response like âIâm sorry I donât understand.â
Here are some examples that require constrained responses:
- What is your favorite fruit? : Mango, Apple, Banana, etc.
- What country are you from? : India, USA, China, etc.
- What song do you want to hear? : Fix you, Lasya, Roja, etc.
- Would like to book the tickets? : Yes, No, Naah, Oh yeah, etc.
The response sets might be lengthy, however they still are finite.
Open Speech
One case is when you want the conversation to have a natural flow, but do not explicitly want to handle input. For example, the assistant might say âHey there! Long time no see!â to which the user might reply âNothing muchâ or âGotta go work now.â In that case, the assistant can give a generic reply like âHmm.. I seeâ or âOk alright.â The user response is not critical for the conversation to continue, the user could say anything and the logic of the response should still be valid. A generic answer is alright. Another strategy for a generic reply would be during confirmations like âAlright! Done!â or âIâll send this information to our customer service team.â
Categorization of input
A good strategy to handle user input is to categorize inputs in broadly defined buckets such as positive/negative, happy/sad/excited, good/bad, etc. The VUI simply looks to map to a category rather than give an exact response.
For example, the VUI can ask âHow was your experience at our restaurant?â:
- Good: Good, Amazing, Terrific, Awesome, etc.
- Bad: Depressing, Bad, Irritating, etc.
The assistant can then respond accordingly.
Protip:Try not announcing what the user is feeling already. For example, when the user says that the experience was bad, do not say âIt seems youâve had a bad experience. Let us know how we can improve.â The user has already indicated the mood, itâs unnatural to repeat it. Instead try saying something reassuring like âIâm sorry to hear that. Want to tell us more?â
Logical expressions
Looking for specific keywords or phrases is a simpler method, however it is important for a voice based system to allow for more complex queries. For example, the intent for the following queries is the same:
- âMy computer is really slowâ
- âMy computer is really really slowâ
- âComputer is slow. What to do?â
Booking a cab or ordering food are simple intents, however there might be variations in the way a user asks for information. It would be a huge task to write a condition for each of these variations. Instead, the system could build a recognition for common patterns.
Negation
Imagine a VUI asking you âHow was you experience at the restaurant?â and you say âNot very good.â The VUI designers have not considered this response and pick up the keyword âgoodâ and responds by saying âAwesome! Thanks!â The VUI already sounds stupid and the user might become wary of trusting the assistant. Handling negation is a much more difficult task, but the cost of ignoring it is high.
Disambiguation
The word disambiguation literally means removing uncertainty and is arguably is one of the most important problems that voice interfaces need to tackle. A simple example would be placing a phone call. If you ask Siri to âcall Johnâ and there are multiple Johns in your contact list, it would ask you âwhich John?â followed by the full names of each of the contacts disambiguated.The system also needs to disambiguate in cases where the user provides insufficient information or excessive information. For example, if a user says âIâd like a large pizzaâ for which the natural followup question could be âwhat kind of pizza would you like?â This is a case of insufficient information which the assistant can resolve by asking a leading question. In the case where a the user gives excessive information which the system is not built to handle, the system can ask the user to provide only one piece of information at a time. However, it would be more beneficial to account for multiple pieces of information.
Capturing intent
For more complex VUIs, you need smarter ways of handling speech input. In many cases like messaging, there are multiple things you can do with a messaging app. You could say, âSend a message to momâ or âRead my last messageâ or âHave I got any messages?â In each case, the intent is different and handling these queries by searching for the keyword âmessageâ might not be the best strategy. Instead the VUIs NLU model should be trained to handle each of these queries as separate intents.
Capturing objects
In cases where user utters multiple pieces of information at once, the NLU model should be able to handle the query and capture objects to be used for the intent. For example, if a user says âOrder me a large cappuccino from Starbucks at home,â the user has already specified the type of coffee, size, restaurant name and place of delivery. The system should be able to pre-fill this information.
You can use tools like Api.ai, Microsoft LUIS, Nuance Mix, Wit.ai, etc. to build and test these models.
Wake words
Wake words are often used to invoke VUI system. For example, âAlexaâ is the wake word for Amazon Echo, while âHey Googleâ or âOk Googleâ are wake words for Google assistant. Using a wake word is one of the ways to start an interaction with the VUI system without having to touch any device.
Following are the things to keep in mind when designing a wake word:
- It should be easily recognizable. Short words like âJimâ or âWillâ are difficult to recognize.
- It should be easy for users to say it.
- Use words with multiple syllables. Take note of Alexa or Siriâs wake words, they all have multiple syllables.
- Donât choose words that people might say regularly in conversations.
Another important thing to note is that wake words should be handled locally. You device should always listen for the wake word and then start recording the userâs voice to send it to the cloud for processing. Always recording and sending data to the cloud is unethical and will lead to serious distrust among users.
TTS versus Recorded voice
Another important decision you need to make is whether to use Text-to-speech (TTS) or a recorded voice to answer user queries. Although a recorded voice feels more natural, it is expensive and time consuming to record for all answers. TTS on the other hand can work realtime, but sounds robotic. Although it is improving, TTS still has difficulty pronouncing certain words, emotion is difficult to indicate.
TTS can be improved by applying Speech Synthesis Markup Language (SSML), which can help add more natural sounding intonations and pronunciations on the fly. Despite this, there are still words and phrases that the TTS engine might have difficulty with and it might be necessary for you to build a pronunciation dictionary.
As a rule of thumb, it is generally a good strategy to use a combination of TTS and recorded voice. Recordings can be used for most common responses like confirmations and greetings. Apart from this, it is also a good strategy to build a voice font of your recording artist, in case you are using a hybrid model.
Voice biometric authentication
Voice biometric authentication also known as Voice ID or Speaker verification is a mechanism which allows users to authenticate themselves just by their voice. Although the technology is improving, it is generally not advisable to use voice ID alone for authentication (Iâm sure iâll have to eat my words about this in the future). Another use of speaker verification is in speaker identification rather than to authenticate. The VUI can identify who is speaking and respond accordingly.
Sentiment analysis
Another way of making your VUI smarter would be using sentiment analysis and detecting emotions. Sentiment analysis often refers to identifying and categorizing opinions in a piece of text. Although it sounds complicated, you constant doing sentiment analysis quite easily. First you need to define your categories (for eg. positive, negative or neutral) and compare what the user said to categorize. You can use open source tools like KNIME to get started.
Emotion detection
This is a fairly new field where companies like Affectiva(now owned by Apple) have begun using facial characteristics to detect emotions of users which can be used by the system to give an appropriate response. Emotion detection can also be done by analyzing voice samples in real time. Beyond verbal detects emotions via a voice stream which their Moodies app analyzes to display the primary emotion.
One key thing to remember when using techniques to detect emotions is to err on the side of caution. Getting the emotional state right is good, however getting it wrong might have disastrous consequences.
References:1. Being DigitalâââNicholas Negroponte2. Designing voice user interfacesâââCathy Pearl3. Design for Voice InterfacesâââLaura Klein
If you liked this article, please click the đ button (once, twice or more). Share to help others find it!
Designing great voice experiences using natural language cues was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.
Disclaimer
The views and opinions expressed in this article are solely those of the authors and do not reflect the views of Bitcoin Insider. Every investment and trading move involves risk - this is especially true for cryptocurrencies given their volatility. We strongly advise our readers to conduct their own research when making a decision.