Speech is the gateway to expressing our feelings, thoughts, ideas, and opinions. Oral communication can happen among humans, animals, birds, and all living creatures. Once the speech is delivered, we are unable to hear it again. It creates some hurdles for everyone. To beat this, we need to record and store the speech for future empowerment.
Once the speech is recorded, the retrieval of the audio file is challenging because the data is gathered from different sources, different equipment, and various time periods. In this blog, we will explore how to overcome all these challenges.
Evolution of Speech Find
To accomplish this, we have a mechanism called SpeechFind. It is an online spoken document retrieval system. The system is specifically designed to index, search, and retrieve historical audio recordings from the National Gallery of the Spoken Word (NGSW). It was invented by the US National Science Foundation.
National Gallery of the Spoken Word (NGSW)
NGSW is the first large-scale repository. It consists of nearly 60,000 hours of audio recordings spanning the 20th century. It consists of speeches, news broadcasts, and recordings of historical events. Recently, the US National Science Foundation introduced a digital format for accessing this library. To attain this, Michigan State University and the University of Colorado Boulder have teamed up, and they split the roles and responsibilities between them.
MSU’s Roles and Responsibilities
The primary tasks of MSU are
- Digitizing the audio recordings
- Organizing the catalogues
- Providing meta-tagging for audio content
- Compression strategies
- Digital Watermarking
University of Colorado Boulder’s Roles and Responsibilities
The fundamental responsibility of the University of Colorado Boulder is to develop robust automatic speech recognition for transcript generation and a prototype audio/metadata/transcript-based user search engine called SpeechFind.
SpeechFind focuses on generating a transcript of audio recordings and performing text-based searches on this transcript. These audio recordings contain reverse-index timing, so the user can reach the exact segment of the speech while searching.
Transcribing NGSW is tougher than transcribing normal voicemail or mobile conversation recordings. They encounter many challenges, such as recordings containing ancient words, advertisements, Background noise, etc.
Overview of SpeechFind System
The SpeechFind contains four modules. Each module is engaged with specific functionality to achieve the goal.
- Audio Spider and transcoder module
- Spoken Document transcriber module
- Linked File module
- Online Search Engine Module
Audio Spider and Transcoder Module
This module is responsible for independently fetching audio recordings from various servers. The recording may be in different formats. Once it receives the source, it identifies the audio recording format and converts it to a uniform 16 kHz, 16-bit format. In addition, it can separate the metadata from the audio recording and save it to a transcript database.
Spoken Document Transcriber Module
It contains two components. They are an audio segmenter and a Transcriber. The audio segmenter splits the audio into smaller segments by identifying the speaker, channel, and environmental change points. Then it produces the text transcription for the segmentation.
If human transcription is available, the segmenter will then find the speaker, channel, and environmental changes in a guided manner. In addition, it acts as a forced aligner to exactly match the given text transcription to the audio.
Additionally, it requires an acoustic model and a language model. An acoustic model is essential for clearly understanding speech and background sound. The language model was used to find the exact word based on the audio’s time period and genre.
In simple terms, the audio stream is given as input to an acoustic model and a language model, which produce the text transcript as output.
Linked Files Module
To make the search more reliable after transcription, each audio file is embedded with three associated files. They are
- The Audio Streamin format (.wav Format)
- The transcript file (.trs Format)
- The Extended archive descriptor (.ead Format)
The .wav format is primarily used to store the uncompressed raw audio data with high quality. But it uses a massive amount of memory to store the data.
The .trs format is a type of XML file specifically designed to combine the audio segment and the text transcript effectively. Transcribers use this format frequently.
The .ead format is used to synchronize the transcript file directly with the audio file. It connects the transcript to the exact time.
Each audio stream has a reverse-index word histogram. In this, stopwords were removed, and the model was used with a search engine for natural language processing.
Online Search Sngine Module
This search engine module is responsible for all information retrieval-related tasks. Its functionality can be divided into two categories. They are front-end and back-end.
1- Role of Front-End
It is a web-based interface that the user uses to type their query, which indicates the text form of the audio script. The front end acts as an intermediary between the backend process and the user. It provides a user-friendly approach for the end user to find what they need.
2- Role of Back-End
The back end receives the user query from the front end and executes it. When the back-end retrieval command is launched, it searches for the user-entered text. It evaluates how the text is matched with the user requirement by providing the relevance score. Based on this score, it aligns audio with exact timing information. Finally, it provides the user with web links and allows them to listen to the exact part of the audio that they want.
Many of the audio collections have been stored on web servers due to copyright and disk space issues. MSU digitizes several audio files, which SpeechFind then accesses.
Advantages of SpeechFind
It offers numerous technical and functional advantages for audio search.
- It is very fast and effective, with improved productivity in finding the exact audio.
- Accuracy reaches its peak level, and access to old audio files becomes easier with SpeechFind.
- It provides the optimal performance in tough situations.
- It provides the time index for the search result, so the user can jump directly to the exact portion without listening to the entire audio.
Summary
So far, we have walked through what SpeechFind is and how it works. In short, it is essential for accessing historical audio recordings without hurdles. With the right keyword, we can access the audio efficiently and conduct our search in a well-structured, organized manner.
You Might Also Like – Audioalter

