Voice Recording Quality For Speech Recognition – More Key Factors to Consider (Infographic)

Contact Us

Contact Us

[contact-form-7 404 "Not Found"]

As we discussed in our recent post about key parameters and constraints when recording voice calls for speech recognition, obtaining a high-quality call recording is extremely crucial if it’s supposed to undergo the process of speech recognition – a compliance process which most companies in regulated sectors such as financial services and government undertake in order to meet their voice call archiving and supervision obligations.

While it explains the fundamental factors that have to be taken into account when recording mobile calls for speech recognition,  there are other, more complex aspects that can affect the quality  of a speech-to-text output:

Voice Recording Quality For Speech Recognition

1. Short Long Files and Batch Transcription

The duration of the voice call recording primarily determines the method of speech recognition that can be used.

  • Short Files – Short call recordings, or those that only have 1 minute or less in duration can be processed via synchronous recognition. Because it only requires shorter audio recordings, synchronous recognition can deliver results immediately after the audio data has been processed.
  • Long Files – Those call recordings with duration of more than 1 minute (up to 180 minutes) can only be processed via asynchronous recognition. This starts a “long running operation”which usually returns an ID which confirms that request has been en-queued for processing.That ID can be used to check on the status and retrieved the results when done.

On most speech-to-text platforms, both synchronous and asynchronous recognition processing can be applied to audio files located on premise or in cloud storage.

Other speech recognition tools, such as Microsoft Azure also support Batch Transcription, which is also a form of asynchronous recognition but with additional features, such as creating batch processing requests, query status, and downloading transcriptions.

2. Recognition of Different Speakers, Channels, and Noise Filtering

Efficient recognition of all the samples from all the people talking in a voice call recording, multiple channels, as well as effective filtering of background noises are all crucial to ensure a quality speech-to-text output.

To recognize multiple speakers in a single voice call recording, speech-to-text recognition platforms such as those of Google use a technique called speaker diarization. This process attempts to distinguish all the different voices included in the audio sample, which helps produce a transcription where each word is tagged to a number assigned to each individual speaker.

To transcribe audio data with multiple channels, which is the case for most voice call recordings, the number of channels present in the audio data must be manually provided to the speech recognition tool.

Most speech recognition tools are also designed to ignore background noise in any audio data, though excessive noise can reduce the accuracy of the output, especially if a lossy codec is used for the audio recording.

3. Language Detection and Supported Languages

Another factor that can affect the quality of a speech-to-text output is the language detection of the voice recognition tool. For organizations that make and capture voice calls that can contain audio input from users in a variety of languages, selecting a speech-to-text solution with multiple language recognition is the best way to ensure that all voice call recordings are processed correctly.

4. Recorded File Stream Recognition

Choosing to process the audio data of a phone call taking place in real time is also possible, though, most speech-to-text tools only support such feature using RPC request styles (compared to synchronous and asynchronous recognition which support both REST and RPC.)

Using this operation, you can stream voice call audio data to the speech-to-text and receive a stream speech recognition results in real time as the audio is processed.

5. Audio Pre-Processing

In cases where the recorded voice call has poor audio quality (significant noise in the background, weak input from speakers), pre-processing of the audio is often warranted.

However, there are speech-to-text platforms that can automatically optimize noisy audio data without requiring additional noise processing, so you must check the tool you plan to use first before you get your voice call recordings for audio pre-processing.

Again, using an enterprise mobile archiving solution that can capture and record voice calls in high quality can make the speech recognition process much faster. At Telemessage, we are currently capturing and recording voice calls as an MP3 file formatted in mono (one channel), 16 bit at 32 Kbps. Businesses that intend to use the recording for further processing, for example, speech recognition, can request higher-quality formats.

To learn more about TeleMessage and our Mobile Archiver capabilities, visit our website today at www.telemessage.com

Skip to content