Building a Web Speech Kit with Python: A Comprehensive Guide

Fahiz
4 min readSep 14, 2024

--

Speech recognition has evolved from an ambitious dream to a fundamental tool in modern applications. With smart assistants, voice-controlled devices, and real-time transcription, we are witnessing the profound power of speech recognition and synthesis technologies. In this post, we’ll dive into building a Web Speech Kit using Python, focusing on speech recognition and text-to-speech capabilities.

Python, being a versatile programming language with strong support for third-party libraries, makes it an ideal choice for building such applications. In this guide, we will cover:

  1. Speech Recognition using Python
  2. Text-to-Speech Synthesis using Python
  3. Building a basic web interface to demonstrate these capabilities
  4. Combining everything into a Web Speech Kit

Requirements

Before we begin, let’s install the required dependencies.

  • Python 3.x
  • SpeechRecognition for speech-to-text
  • gTTS (Google Text-to-Speech) for text-to-speech
  • Flask for building a web interface

You can install the required libraries using the following command:

pip install SpeechRecognition gTTS Flask

Step 1: Speech Recognition Using Python

The first step in our Web Speech Kit is to recognize speech using Python. We’ll leverage the SpeechRecognition library, which provides an easy-to-use interface for different speech recognition engines, including Google's Web Speech API.

Here’s how to implement basic speech recognition:

import speech_recognition as sr

def recognize_speech_from_microphone():
recognizer = sr.Recognizer()
mic = sr.Microphone()

with mic as source:
print("Adjusting for ambient noise...")
recognizer.adjust_for_ambient_noise(source)
print("Listening...")
audio = recognizer.listen(source)

try:
print("Recognizing speech...")
text = recognizer.recognize_google(audio)
print(f"Transcription: {text}")
return text
except sr.RequestError:
print("API was unreachable or unresponsive")
except sr.UnknownValueError:
print("Unable to recognize speech")

# Test the function
recognize_speech_from_microphone()

Explanation:

  1. Recognizer: This class is responsible for recognizing speech.
  2. Microphone: We use this to capture audio input from the user's microphone.
  3. adjust_for_ambient_noise: This adjusts the recognizer sensitivity based on the ambient noise level.
  4. recognize_google: This method sends the audio data to Google’s Web Speech API for transcription.

Step 2: Text-to-Speech Synthesis with gTTS

Now that we have a way to recognize speech, let’s convert text to speech. We’ll use the gTTS library (Google Text-to-Speech), which converts text into an mp3 file that can be played back.

Here’s the code to implement text-to-speech:

from gtts import gTTS
import os

def text_to_speech(text):
tts = gTTS(text=text, lang='en')
tts.save("output.mp3")
os.system("start output.mp3") # For Windows use 'start', for MacOS use 'afplay' and Linux 'mpg321'

# Test the function
text_to_speech("Hello! This is your Python-based web speech kit.")

Explanation:

  1. gTTS: Google Text-to-Speech API is used to convert the provided text into speech.
  2. save: This saves the converted speech into an mp3 file.
  3. os.system: This command plays the mp3 file.

Step 3: Building a Simple Web Interface

To make this kit accessible via a browser, we’ll create a basic web interface using Flask. We will create two routes: one for speech recognition and one for text-to-speech.

Setting up Flask

from flask import Flask, render_template, request, redirect, url_for
import speech_recognition as sr
from gtts import gTTS
import os

app = Flask(__name__)

# Home route
@app.route('/')
def index():
return render_template('index.html')

# Speech recognition route
@app.route('/recognize', methods=['POST'])
def recognize_speech():
recognizer = sr.Recognizer()
mic = sr.Microphone()

with mic as source:
recognizer.adjust_for_ambient_noise(source)
audio = recognizer.listen(source)

try:
text = recognizer.recognize_google(audio)
return render_template('result.html', transcription=text)
except sr.RequestError:
return "API is unavailable"
except sr.UnknownValueError:
return "Unable to recognize speech"

# Text-to-Speech route
@app.route('/speak', methods=['POST'])
def speak():
text = request.form['text']
tts = gTTS(text=text, lang='en')
tts.save("output.mp3")
os.system("start output.mp3")
return redirect(url_for('index'))

if __name__ == '__main__':
app.run(debug=True)

Explanation:

  1. Flask Application: This is the basic structure of a Flask app. We define two routes:
  • / to display the home page
  • /recognize to process speech recognition requests
  • /speak to handle text-to-speech conversion.

2 . Speech Recognition: This is similar to our previous implementation but integrated with a web interface.

3. Text-to-Speech: We fetch the text from the form, convert it to speech, and save it as an mp3 file.

Step 4: Creating the Web Interface (HTML)

We need an HTML page to interact with our Flask backend. Here’s a simple index.html for our web app.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Web Speech Kit</title>
</head>
<body>
<h1>Web Speech Kit</h1>

<h2>Speech Recognition</h2>
<form action="/recognize" method="POST">
<button type="submit">Recognize Speech</button>
</form>

<h2>Text-to-Speech</h2>
<form action="/speak" method="POST">
<input type="text" name="text" placeholder="Enter text to convert to speech">
<button type="submit">Convert to Speech</button>
</form>

</body>
</html>

Explanation:

  • Speech Recognition: The form sends a POST request to /recognize when the button is clicked, triggering the microphone input and speech recognition process.
  • Text-to-Speech: The input field collects text, and the form submits this text to the backend, where it’s converted into speech.

Step 5: Running the Application

To run the application, execute the following command:

python app.py

Open your browser and navigate to http://127.0.0.1:5000/. You'll see a basic interface where you can test both the speech recognition and text-to-speech features.

Conclusion

You’ve just built a Web Speech Kit using Python that can recognize speech and synthesize text-to-speech. While this is a simple example, it can be expanded into more sophisticated applications, including chatbots, virtual assistants, and accessibility tools. Speech technology is an exciting area with vast potential, and Python makes it accessible to developers of all skill levels.

Now that you have a basic understanding, you can start building more complex projects using these fundamentals. Enjoy coding!

--

--