The tools we use for voice recognition and speech synthesis

The tools we use for voice recognition

One of DevelopEx’ projects needed the technologies of Human Speech Recognition and speech synthesis. The market suggests dozens of good solutions. We made the research where we explored several tools for Human Speech Recognition, its benefits and disadvantages. Then try to choose one… or not only one as time showed. After weeks of trying and mistakes we’ve stopped at three the most valuable systems and used it in the project the project. Now we’ll shortly explain the flow of our thinking.

General specification

The software (final Developex’ project) should allow to manage various kinds of electronic devices using only voice commands without using hands. The software should:

be reliable (the application must interact with a user in any case, give feedback and instructions),
have quick reaction to person’s commands,
work fine without internet connection,
work fine in case of noisy background,
understand multilanguage.

The main task was to find the right set of tools for human speech recognition that provide the pointed requirements. But no tool could cover all demands alone. At the beginning we used Google products. Then we realized it could not cover all req’s and tried the Nuance products. Finally we’ve integrated several tools. One tool covers separate set of cases.

Nuance with Google

The tools that we are actually using are:

Recognizer and Text-to-speech (from Google),
VoCon (from Nuance),
NLU (from Nuance).

As we develop Android app we started working from the Google Recognizer as native software for Android. Recognizer is O.K. to understanding phrases from live speech. But sometimes it failed with understanding short commands (like ‘Yes’ or ‘Cancel’ ) and with human names.
Nuance VoCon managed short commands and names recognition fine. Another great benefit, this tool has great offline features. If something happened with connection the application still works. That was what we needed. The bad side VoCon is that working with live speech is absent at all. So we applied VoCon just for three tasks described above.
The third tool we applied is Nuance Cloud Recognizer with NLU processing. This set is able to work with live speech. We discovered quality of the Nuance’s product is pretty equal to Google’s’ software. And we’ve started to use NLU as an alternative way to recognise live speech. Another great Nuance’s benefit is Cloud that converts audio stream to text. Then NLU makes the post processing. The pair Nuance Cloud and NLU is so good that these products became dominant for the our project. We’ve dropped Recognizer for safety and better recognition.
Finally, we need to create nice impression for user while s(he) communicates with the device. TTS (Google Text-to-speech) application is the better way to do it. First, it builds-in to Android OS and second TTS has a lot of extensions to make the voice sound natural.

Screenshot made with Grabilla

Example

Here is a simple example of the program usage. The task is to send message to John Presley whose contact is written to phone. Be sure the software is able to process much more complex commands like map a route or find and begin to play certain music.
But now just let’s send sms without touching the device.

Activate the program.
Say ‘Send message to John Presley.
Make sure the virtual assistant recognizes the command (it should be repeated by AI assistant).
Say the message ‘Hi John, how are you today?’
Again make sure that message is correct.
Confirm the message sending.

That’s it.

Addition 1.

Here you may see one of an abstract NLU structure example: Calling domain

{
	"Input": {
		"Interpretations": [ "send message to john presley" ],
		"Type": "text"
	},
	"Instances": [
		{
			"nlu_classification": {
				"Domain": "Messaging",
				"Intention": "Dial"
			},
			"nlu_slot_details": {
				"Contact-name": {
					"literal": "John"
				},
				"Phone-type": {
					"canonical": [ "mobile" ],
					"literal": "Mobile"
				}
			}
		}
	],
	"type": "nlu_results"
}

User query: Send message to John Presley

Addition 2.

A short visualisation how the NLU parses voice. If you say ‘Send message to John Presley’, the piece of NLU log has a view like this:

NluRecognizer:Send
NluRecognizer:Send me
NluRecognizer:Send my
NluRecognizer:Send message
NluRecognizer:Send message to
NluRecognizer:Send message to John
NluRecognizer:Send message to Johnna Prez
NluRecognizer:Send message to Johnna Pezley
NluRecognizer:Send message to John Presley
I/NluRecognizer: contacts=[John Presley]

As always, feel free to contact us for a consultation!

The tools we use for voice recognition and speech synthesis