Building a Real-Time Speech Transcription System with Vosk

3 min readAug 26, 2024

Introduction

In today’s fast-paced world, real-time speech transcription has become an essential tool across various industries, from medical applications to customer service. Whether you’re building a voice-controlled application or simply need an accurate speech-to-text solution, the Vosk speech recognition toolkit provides an efficient and open-source way to transcribe speech in real-time. In this post, we’ll explore how to set up and use Vosk for transcription in your projects.

Why Vosk

Vosk is a powerful, offline, open-source speech recognition toolkit that supports multiple languages. Unlike other solutions that require an internet connection and consume substantial computational resources, Vosk works entirely offline, making it ideal for edge devices and applications that require low latency.

Some of the key features of Vosk include:

High accuracy: Supports a wide range of languages with models that are continuously updated for better accuracy.
Low resource consumption: Works efficiently on low-end devices, making it suitable for embedded systems and mobile applications.
Real-time processing: Capable of transcribing speech in real time, which is crucial for applications like voice assistants and real-time translation.
Flexible integration: Easily integrates with Python, Node.js, and other popular programming languages, allowing you to incorporate it into various projects.

Setting Up Vosk

Before we dive into the code, let’s set up the environment. The first step is to install the Vosk package, which can be done using pip:

pip install vosk

Next, you’ll need to download a pre-trained language model. Vosk provides models for several languages, but for this tutorial, we’ll use the English model:

wget https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
unzip vosk-model-small-en-us-0.15.zip

Once the model is downloaded and unzipped, you’re ready to start building your transcription application.

Writing the Transcription Script

Let’s create a simple Python script to transcribe audio in real time using Vosk. The following code will capture audio from your microphone and transcribe it on the fly:

import os
import sys
import wave
import json
import pyaudio
from vosk import Model, KaldiRecognizer

# Path to your model directory
model_path = "path_to_your_model"

# Load the Vosk model
if not os.path.exists(model_path):
    print(f"Model path '{model_path}' does not exist")
    sys.exit(1)

model = Model(model_path)
recognizer = KaldiRecognizer(model, 16000)

# Initialize PyAudio
audio = pyaudio.PyAudio()
stream = audio.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8192)
stream.start_stream()

print("Listening...")

try:
    while True:
        data = stream.read(4096)
        if len(data) == 0:
            break
        if recognizer.AcceptWaveform(data):
            result = recognizer.Result()
            text = json.loads(result)["text"]
            print(f"Recognized: {text}")
        else:
            partial_result = recognizer.PartialResult()
            print(f"Partial: {json.loads(partial_result)['partial']}")
except KeyboardInterrupt:
    print("Terminating...")
finally:
    stream.stop_stream()
    stream.close()
    audio.terminate()

Understanding the Code

Loading the Model: The Vosk model is loaded using the Model class. This model is then passed to the KaldiRecognizer, which is responsible for recognizing speech from the audio stream.
Audio Capture: We use the PyAudio library to capture audio from the microphone. The stream.read(4096) function reads audio data in chunks, which are then passed to the recognizer.
Speech Recognition: The recognizer.AcceptWaveform(data) function processes the audio data and checks if a complete sentence has been recognized. If so, it returns the result as JSON, which is then parsed and printed.
Partial Results: While the recognizer is processing, it may provide partial results. These are useful for real-time feedback, such as showing users what they are saying as they speak.

Use Cases

Vosk’s flexibility makes it suitable for various applications:

Voice-controlled applications: Integrate speech recognition into your voice assistant or home automation system.
Transcription services: Build a transcription tool for meetings, lectures, or interviews.
Language learning: Create an application that helps users practice pronunciation by comparing their speech to accurate transcriptions.

Conclusion

Vosk is a robust and efficient tool for real-time speech transcription. Its offline capabilities and ease of integration make it an excellent choice for developers looking to add speech recognition to their projects. With just a few lines of code, you can start transcribing audio in real time, opening up a world of possibilities for your applications.

Whether you’re working on a small hobby project or a large-scale enterprise application, Vosk provides the tools you need to implement high-quality speech recognition quickly and easily.

Call to Action

Ready to start building with Vosk? Head over to the official Vosk GitHub repository to explore more features, models, and documentation. Happy coding!