*draft of the article was dictated and then translated into text by one of SR system.
Speech recognition (SR) – technologies that enables the recognition and translation of spoken language into text by computers. It is also known as “speech to text” (STT) or (S2T) or “voice to text” (V2T).
Systems which uses SR are now well known and even Siri isn’t something special. While IT-giants already presented the solutions based on SR, other companies are just beginning of implementing SR in their products. There are plenty well known solutions: Google Now, Siri, Amazon Alexa, Cortana. Therefore it is not surprising why there are many open API that are ready for use by developers. But there is no any universal solution that is suitable in almost all cases. All open APIs are different and often not all requirements can be met by one system but to preserve project consistency only one API should be selected. Often due to initial choice depends the success of the project. In this overview we’ll look at SRs which supports android platform, but at the same time we’ll mention cross-platform support of API if any. We’ll not cover all open APIs but the most popular ones which developers usually use in production. All SR systems can be divided into 3 groups: ready to be embedded solutions, solutions based on client-server architecture and solutions which at the same time ready to be embedded and can be used for develop own client-server architecture. You can see the comparison in the table below.
SR API name | language set | offline ability | cross-platform | customizable | cost | our demo app |
---|---|---|---|---|---|---|
Android SpeechRecognizer* | 20 | no/yes** | no | + | free | Android SpeechRecognizer Demo |
Google Cloud Speech | 80+ | no | yes | ++ | free / paid | Google Cloud Speech Demo |
Microsoft Bing Speech | 26 | no | yes | ++ | free / paid | Microsoft Bing Speech Demo |
CMU Sphinx | English only | yes | yes*** | +++ | free | CMU Sphinx Demo |
* initially embedded solution, part of Android API
** only for english by default, others need be updated by user manually
*** core written in C language and there are a lot of ports onto popular platforms
SpeechRecognizer API
already built into Android starting with version 2.2 (Froyo) API and works online by default, but it can work offline(offline work can be prioritized) if there isn’t connection and user manually has been downloaded the language model earlier. For each language user should download separate model if he want to have ability to work offline and this process can not be programmed. But english model pre-downloaded by default. Also the user can update models on device when google roll-out update. So quality of recognition will be increased with time. Significant advantage is simplicity of API and developer can easy and fast use it.
At the same time there is disadvantages. First of all system work only by short sessions (~1 minute) which enough for recognize separate word, or even short sentence. Also this API cannot be much customized. For example session duration cannot be changed and there isn’t possible way to make uninterruptible sequence of session because start of every session begin with short audio signal (which inform user of beginning session) at this moment speech record is off. And this invitation audio signal cannot be disabled or changed.
Other significant disadvantage is support only android platform. If project should have applications on few mobile platforms or on a web or desktop platforms and use different SR systems on different platforms than UX and quality of recognition and set of supported languages on android will different from others platform. It must be mentioned that there is separate API for each platform which work only on one platform either iPhone or Windows. SpeechRecognizer API can be taken if project support only android platform and there isn’t very specific requirements which need flexible customization, otherwise you need consider one of the next solutions.
The most widely used client-server solutions like Google Cloud Speech, Microsoft Bing Speech, etc. This group of solutions requires a stable connection and can not work offline. This solutions support a wide range of available languages and have a higher quality of recognition, which tends of constantly improve. It is also possible to set a special bunch of words which expected to be pronounced by the user. For example, if the system expects speech about the weather, we can set a bunch of words specific to this topic, what significantly improve speed and the quality of recognition. But this type of API has a limits of use. Of course there is a free amount of request or free total speech time, but when you exceed the limits, then API either will not work or you will need to pay.
Google Cloud Speech
recognizes more than 80 languages and variants, this is maximum number among the reviewed system. In addition API has good quality and well-documented. There are many official demo project on all popular platforms, that will speed up developing in the initial stage. Google Speech has a good ability to customize because the developer can operate very low process starting from recording audio to sending request by several types and even having the ability to add substeps (do not send the recorded audio if the audio level is low and thus save API limit by this way). For such a low-level API, more time is required for development, even for a basic project.
The API can be used in two ways: one when requests are synchronous (~1 min duration) and second when requests are asynchronous then the audio is recorded in small chunks and send to google cloud one by one and receive result after each request. Total duration can reach 80 minutes. There are separate limits for each type of API usage.
Microsoft Bing Speech
has similar core principle and similar quality, but has fewer available languages. The API is also well documented and there are several official demos.
We should also mention that there are additional services which related with speech recognition. Google has an open API for text analysis (for extracting information from text and recognizing morphology) At the same time the additional Bing’s API can recognize and separate the speech of different speakers if more than one person is participates in the conversation. There are many other open API related with speech recognition which can be used in your projects.
Open-source speech recognition
There are open-sourced already trained solutions (CMU Sphinx, Kaldi, etc) which are ready for use in your project. They are usually written in C/C++ language and there are many ports for others languages / platforms. And even if there isn’t port in your language than you can port yourself, because this is the open-source. These solutions are completely integrated into your project and therefore they can be used offline. If you use it on various embedded systems or phones than there are many disadvantages. Recognize at least one word from entire dictionary will take a long time. Therefore usually a developer select a smaller sub-dictionary (2-100 words) which expected to be use by users. Language models cannot be updated. If you need to recognize all words that available by model then you must build a client-server architecture like Google Speech or Bing Speech.
If any of the previously mentioned API does not meet the project’s requirements than there is the last chance – an absolutely custom solution based on neural networks and big data used for training model. Using this solution you can achieve any specific requirements and even achieve bigger recognition quality than Google Speech for special cases or support the languages which do not support by other open APIs. It is possible to develop any model or architecture (client-server, embedded or hybrid as Android SpeechRecognizer API). But this solution is not suitable for general case, you cannot achieve results of Google Speech or Bing Speech systems. Due to the fact that Bing or Google train their systems with large amounts of data, than can be available to anyone. It should be understood that this solution will take much longer than any early mentioned solutions, because there is needs to develop, train models and follow-up support.
In any case the final choice lies on shoulders of the client under the watchful eye of a technical specialist who has more experience in the implementation of Speech Recognition.