Voice Recording for Speech Recognition – Key Parameters and Constraints (Infographic)

Contact Us

Contact Us

[contact-form-7 404 "Not Found"]

With respect to recording voice calls, most industry regulators require organizations to capture and record telephone conversations as listenable audio files in which words and phrases can be heard and understood clearly.

While seemingly straightforward, the compliant recording of voice calls is a considerable task given the unique nature of audio data. Often, compliance offices utilize tools to address telephone recording compliance; searching by call metadata, such as the persons that made and received the call and the date and time of the call; and then finally listening to the sampling.

In sectors such as financial services where massive number of phone calls are generated daily, monitoring and supervising phone calls for fraud detection and risk management can be challenging. To address this issue, many financial firms use speech analytics and speech-to-text tools to track keywords and phrases for more efficient supervision and surveillance.

For companies that use these systems, a higher quality audio file of the captured voice calls is required. Read on as we detail in this infographic the vital technical factors that must be considered for compliant voice recording for speech recognition.

Voice Recording for Speech Recognition

1. Sample Rate

For the uninitiated, Sample Rate refers to how “fast” samples are taken. It’s measured in “samples per second” and is usually expressed in kilohertz (kHz), a unit meaning 1,000 times per second. Basically, the higher the sampling rate, the greater is the audio quality,

In call recording perspective, the sample rate of the audio is typically dependent on the data transfer rate of the mobile network, i.e., the frequency range of audio signals transmitted over telephone lines, as well as on the sound recording capabilities of the mobile device. For instance, a mobile carrier with 4G (LTE) capabilities can provide an audio bandwidth of 7 kHz or even up to 22 kHz, which means clearer audio output vs. 3100 Hz of the traditional telephone connection.

2. Bit Rate and Bit Depth

Bit Rate, on the other hand, refers to the number of bits processed per unit of time. Measured in kilobits per second (Kbps), a higher bit rate means bigger file size and better audio quality.

A related term called Bit Depth, sometimes referred to as audio resolution, describes the resolution of the sound data that is captured and stored in an audio file. Audio files can be recorded at 16-bit, 24-bit, 32-bit, or even 64-bit; the higher the bit depth, the better the quality of the audio.

While low bit rate and bit depth are acceptable for recording voice calls, a higher bit rate is required for accurate speech recognition processing .

When a call is taking place, an enterprise archiving platform receives two streams of audio – one from the person that initiated the call and the other from the person receiving the call — both may have different sample rates if they use two different networks. The platform merges these two audio streams into a single file which can be saved at different bit rates, bit depths, and formats, depending on the purpose it will serve for the business.

3. Encoding Format

Audio encoding refers to the manner the audio data is stored and transmitted. Depending on the call recording platform, voice recordings can be stored into two broad types of the codec: uncompressed and compressed

  • Uncompressed audio is mainly found in the Pulse Code Modulation (PCM) format of audio CDs. Generally, audio encoding means going from uncompressed PCM to some kind of compressed audio format. These files require significantly more digital space.
  • Compressed audio is split into two groups, lossless and lossy.
  1. Lossless audio compresses digital audio data using complex rearrangements of the stored data but results in no degradation in the quality of the original digital sample. This is useful for archiving voice recordings at the highest quality possible, and for companies for whom storage space is not an issue.
  2. Lossy compression involves the elimination of certain types of information during the construction of compressed audio data, hence the term “lossy.”

4. File Format

File format refers to the file container that holds one or more codecs. A thorough understanding of the sample rate and the bit rate is required when choosing the right format for the voice call recording. Some of the commonly used audio formats include the following:

  • MP3 – The MP3 format is a commonly used lossy compression audio format. It essentially reduces the file size by omitting data in the file. It uses perceptive audio coding and psycho acoustic compression to retain the quality as close to the original as possible.
  • WAV – The WAV audio format stores uncompressed audio data on Windows computers. It is based on the RIFF bitstream format method of storing data.
  • AIFF – The Audio Interchange File Format (AIFF) is an uncompressed audio format commonly used for storing audio data on Apple Macintosh systems.
  • AAC – The Advanced Audio Coding (AAC) format is another lossy compression audio format developed as the successor of the MP3. It offers better audio quality than the MP3 at lower sizes.
  • WMA – The Windows Media Audio (WMA) format is a lossy compression audio format designed by Microsoft to compete against the MP3.

Selecting an enterprise mobile archiving solution that can capture and record voice calls in high quality can make the speech recognition process much faster. In Telemessage, we are currently capturing and recording voice calls as an MP3 file formatted in mono (one channel), 16 bit at 32 Kbps. Businesses that intend to use the recording for further processing, for example speech recognition, can request higher-quality formats.

To learn more about TeleMessage and our Mobile Archiver capabilities, visit our website today at www.telemessage.com

Skip to content