Live Speech-to-Text Transcription Implementation
Abstract
This article explains how to implement live speech-to-text transcription using web technologies. It covers the system architecture, components, and techniques for achieving live performance, including the audioStore
and transcriptionStore
for frontend management, and the TranscriptionHandler
for backend operations.
Introduction
live speech-to-text transcription enables live captioning, virtual assistants, and voice response systems. This implementation uses Vue.js and Nuxt.js with Pinia for the frontend, Python and FastAPI for the backend, WebSockets for communication, and OpenAI’s Whisper for transcription.
Architecture Overview
The implementation follows a client-server architecture divided into two main parts:
- Frontend: Handles audio capture, processing, state management, and user interaction.
- Backend: Manages transcription services, threading, and WebSocket communication.
High-Level Architecture Diagram
You can visualize the architecture using the following PlantUML diagram:´
Figure 1: High-level architecture of the live transcription implementation.
Frontend Components
Overview
The frontend is built using Vue.js and Nuxt.js, with Pinia as the state management library. It consists of:
- Components: Primarily responsible for user interface and display.
- Pinia Stores: Manage application state and logic.
- Web Workers: Perform computationally intensive tasks without blocking the main thread.
AudioRecorder Component (AudioRecorder.vue
)
Role:
- Provides the user interface for recording controls (start/stop recording).
- Displays the transcribed text in live.
- Interacts with
audioStore
andtranscriptionStore
for recording and transcription functionalities.
Key Implementation Details:
- The component primarily handles user interactions and displays information.
- It delegates recording and transcription logic to the respective Pinia stores.
Code Snippet:
<template>
<div class="space-y-4">
<div class="flex items-center space-x-2">
<button @click="handleRecordingToggle" :disabled="disabled || audioStore.isStopping">
<span>{{ buttonText }}</span>
</button>
<!-- Additional UI elements -->
</div>
<!-- Display transcribed text -->
<div v-if="conversationStore.currentRequirement">
<p>{{ conversationStore.currentRequirement }}</p>
</div>
</div>
</template>
<script setup lang="ts">
import { computed } from 'vue';
import { useAudioStore } from '~/stores/audioStore';
import { useTranscriptionStore } from '~/stores/transcriptionStore';
import { useConversationStore } from '~/stores/conversationStore';
const audioStore = useAudioStore();
const transcriptionStore = useTranscriptionStore();
const conversationStore = useConversationStore();
const buttonText = computed(() => {
if (audioStore.isStopping) return 'Stopping...';
return audioStore.isRecording ? 'Recording...' : 'Start Recording';
});
const handleRecordingToggle = async () => {
if (!audioStore.isRecording) {
await audioStore.startRecording(workspaceId, stepId);
} else {
await audioStore.stopRecording(workspaceId, stepId);
}
};
</script>
audioStore
(Pinia Store)
Role:
- Manages the audio recording process.
- Interfaces with the
Audio Processor Worker
for audio processing. - Handles starting and stopping of recording sessions.
- Maintains the state of recorded audio chunks.
Key Implementation Details:
- Utilizes the Web Audio API to capture audio from the user’s microphone.
- Uses an
AudioWorklet
(Audio Processor Worker
) to process raw audio data into WAV format. - Stores audio chunks and provides functionality to manage them (e.g., download, delete).
Code Snippet:
// audioStore.ts
import { defineStore } from 'pinia';
export const useAudioStore = defineStore('audio', {
state: () => ({
isRecording: false,
audioChunks: [],
// other state properties
}),
actions: {
async startRecording(workspaceId: string, stepId: string) {
// Initialize media stream and audio context
// Connect to Audio Processor Worker
// Start recording
},
async stopRecording(workspaceId: string, stepId: string) {
// Stop media stream
// Flush Audio Processor Worker
// Finalize transcription
},
// Additional actions for managing audio chunks
},
});
transcriptionStore
(Pinia Store)
Role:
- Manages communication with the backend via WebSocket.
- Handles sending processed audio data for transcription.
- Receives transcribed text and updates the state accordingly.
- Interfaces with the
Transcription Worker
for WebSocket communication.
Key Implementation Details:
- Establishes and maintains a WebSocket connection with the backend.
- Sends audio data received from
audioStore
to the backend. - Updates the
conversationStore
with transcribed text.
Code Snippet:
// transcriptionStore.ts
import { defineStore } from 'pinia';
export const useTranscriptionStore = defineStore('transcription', {
state: () => ({
isConnected: false,
transcription: '',
// other state properties
}),
actions: {
initializeWorker() {
this.worker = new Worker(new URL('~/workers/transcriptionWorker.ts', import.meta.url), {
type: 'module',
});
this.setupWorkerHandlers();
},
setupWorkerHandlers() {
this.worker.onmessage = (event) => {
const { type, payload } = event.data;
if (type === 'MESSAGE') {
this.handleWorkerMessage(payload);
}
// Handle other message types
};
},
handleWorkerMessage(message) {
if (message.type === 'transcription') {
this.transcription += message.text + ' ';
// Update conversationStore with new transcription
}
},
async sendAudioChunk(audioChunk: ArrayBuffer) {
// Send audio chunk to Transcription Worker
},
// Additional actions for managing WebSocket connection
},
});
Audio Processor Worker (audio-processor.worklet.js
)
Purpose:
- Processes raw audio data captured by the
audioStore
. - Converts raw audio streams into WAV format compatible with the Whisper model.
- Operates as an
AudioWorklet
, which is a high-performance audio processing script that runs on the audio rendering thread.
Key Implementation Details:
- Handles audio data in small chunks for live processing.
- Resamples audio to the target sample rate (e.g., 16kHz).
- Encodes audio data into 16-bit PCM WAV format.
Code Snippet:
// audio-processor.worklet.js
class AudioChunkProcessor extends AudioWorkletProcessor {
constructor(options) {
super();
// Initialize processor options
}
process(inputs, outputs, parameters) {
// Process audio data
// Resample and encode to WAV
// Post message with processed audio data
return true;
}
}
registerProcessor('audio-chunk-processor', AudioChunkProcessor);
Transcription Worker (transcriptionWorker.ts
)
Responsibilities:
- Establishes a WebSocket connection with the backend server.
- Sends processed audio data to the backend for transcription.
- Receives transcribed text from the backend and relays it to the
transcriptionStore
.
Key Implementation Details:
- Manages the WebSocket lifecycle (connect, disconnect, error handling).
- Handles binary data transmission for audio chunks.
- Parses incoming messages and forwards them to the
transcriptionStore
.
Code Snippet:
// transcriptionWorker.ts
let socket;
onmessage = (event) => {
const { type, payload } = event.data;
switch (type) {
case 'CONNECT':
initWebSocket(payload);
break;
case 'SEND_AUDIO':
socket.send(payload.wavData);
break;
// Handle other message types
}
};
function initWebSocket({ workspaceId, stepId, transcriptionWsEndpoint }) {
socket = new WebSocket(`${transcriptionWsEndpoint}/${workspaceId}/${stepId}`);
socket.onmessage = (event) => {
const message = JSON.parse(event.data);
postMessage({ type: 'MESSAGE', payload: message });
};
}
Backend Components
Overview
The backend is implemented using Python and FastAPI. It consists of:
- Transcription Handler (
handler.py
): Manages threading, multiple WebSocket sessions, and coordinates transcription requests. - Transcription Service (
service.py
): Handles the transcription of audio data without managing threading.
Transcription Handler (handler.py
)
Manages:
- WebSocket connections with clients.
- Threading and coordination of multiple WebSocket sessions.
- Queuing transcription requests and distributing them to the transcription worker.
- Processing transcription results and sending them back to clients.
Key Implementation Details:
- Uses threading and asynchronous programming to handle multiple connections efficiently.
- Maintains separate queues for transcription requests and results.
- The
TranscriptionWorker
thread processes transcription requests sequentially. - Each client session is identified by a unique
session_id
.
Code Snippet:
# handler.py
import asyncio
import queue
import threading
from fastapi import WebSocket
from .service import TranscriptionService
class TranscriptionHandler:
def __init__(self):
self.transcription_service = TranscriptionService()
self.active_connections = {}
self.output_queues = {}
self.loop = asyncio.get_event_loop()
self.worker = TranscriptionWorker(self.transcription_service, self.loop)
self.worker.start()
async def connect(self, websocket: WebSocket, workspace_id: str, step_id: str) -> str:
await websocket.accept()
session_id = str(uuid.uuid4())
self.active_connections[session_id] = websocket
self.output_queues[session_id] = asyncio.Queue()
self.worker.result_queues[session_id] = self.output_queues[session_id]
asyncio.create_task(self._receive_audio(websocket, session_id))
asyncio.create_task(self._send_results(websocket, session_id))
await websocket.send_json({"type": "session_init", "session_id": session_id})
return session_id
async def _receive_audio(self, websocket: WebSocket, session_id: str):
while True:
audio_data = await websocket.receive_bytes()
request = TranscriptionRequest(session_id=session_id, audio_data=audio_data, timestamp=time.time())
self.worker.request_queue.put_nowait(request)
async def _send_results(self, websocket: WebSocket, session_id: str):
while True:
result = await self.output_queues[session_id].get()
await websocket.send_json(result)
TranscriptionWorker:
- A separate thread that processes transcription requests sequentially.
- Interacts with the
TranscriptionService
to perform the actual transcription.
class TranscriptionWorker(threading.Thread):
def __init__(self, transcription_service: TranscriptionService, loop: asyncio.AbstractEventLoop):
super().__init__()
self.transcription_service = transcription_service
self.request_queue = queue.Queue()
self.result_queues = {}
self.loop = loop
def run(self):
while True:
request = self.request_queue.get()
if request is None:
break # Shutdown signal
transcription = self.transcription_service.transcribe(request.audio_data)
if request.session_id in self.result_queues:
result_queue = self.result_queues[request.session_id]
asyncio.run_coroutine_threadsafe(
result_queue.put({
"type": "transcription",
"text": transcription,
"timestamp": request.timestamp
}),
self.loop
)
Transcription Service (service.py
)
Handles:
- Transcription of audio data using the OpenAI Whisper model.
- Does not manage threading; it is called by the
TranscriptionWorker
when processing requests.
Key Implementation Details:
- Initializes the Whisper model and processor.
- Determines the appropriate device and data type based on system capabilities.
- Performs transcription without concern for threading, as threading is managed by
TranscriptionHandler
andTranscriptionWorker
.
Code Snippet:
# service.py
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
class TranscriptionService(metaclass=SingletonMeta):
def __init__(self):
self.device, self.torch_dtype = self._setup_device_and_dtype()
model_id = "openai/whisper-large-v3-turbo"
self.model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=self.torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
).to(self.device)
self.processor = AutoProcessor.from_pretrained(model_id)
device_arg = self._get_pipeline_device()
self.pipe = pipeline(
"automatic-speech-recognition",
model=self.model,
tokenizer=self.processor.tokenizer,
feature_extractor=self.processor.feature_extractor,
torch_dtype=self.torch_dtype,
device=device_arg,
)
self.sampling_rate = 16000
def transcribe(self, audio_data: bytes) -> str:
transcription = self.pipe(audio_data)
return transcription.get("text", "").strip()
Emphasis on Separation of Concerns:
TranscriptionHandler
: Manages threading and handles multiple WebSocket sessions, ensuring that transcription requests are processed efficiently.TranscriptionService
: Focuses solely on the transcription logic without managing threading, making it reusable and easier to maintain.
Data Flow and Communication
The data flow involves multiple components working in tandem to capture, process, transmit, and transcribe audio data in live.
Data Flow Diagram
Figure 2: Detailed data flow between frontend and backend components, highlighting threading and session management.
Process Steps
- Audio Capture: User initiates recording via the
AudioRecorder
component. - Start Recording:
AudioRecorder
callsaudioStore.startRecording()
. - Audio Processing:
audioStore
sets up theAudio Processor Worker
to process raw audio data into WAV format. - Data Transmission: Processed audio chunks are sent from
audioStore
totranscriptionStore
, which sends them to the backend via theTranscription Worker
. - Session Management:
TranscriptionHandler
accepts the WebSocket connection and assigns a uniquesession_id
. - Threading and Queuing:
TranscriptionHandler
queues transcription requests and manages threading via theTranscriptionWorker
thread. - Transcription Handling:
TranscriptionWorker
processes requests sequentially and callsTranscriptionService
for transcription. - Result Delivery: Transcribed text is placed in the output queue and sent back to the frontend through the WebSocket connection.
- Update State:
Transcription Worker
sends the transcribed text totranscriptionStore
, which updates theconversationStore
. - Display:
AudioRecorder
component displays the transcribed text in live.
Key Techniques for live Performance
Thread Management in Backend
- Purpose: Efficiently handle multiple client connections and transcription requests.
- Implementation:
- The
TranscriptionHandler
uses asynchronous tasks to manage WebSocket connections. - A separate
TranscriptionWorker
thread processes transcription requests from all clients sequentially. - Requests are queued, and results are sent back via output queues specific to each session.
Benefits:
- Resource Efficiency: By using a single worker thread for transcription, resource usage is optimized, especially important when dealing with heavy models like Whisper.
- Scalability: Can handle multiple clients without spawning excessive threads or processes.
Audio Chunking and Asynchronous Processing
- Purpose: Reduces latency and ensures smooth live transcription.
- Implementation:
- Audio data is processed and sent in chunks to allow for incremental transcription.
- Asynchronous programming is used both on the frontend and backend to handle tasks without blocking the main thread.
Worker Threads on Frontend
- Purpose: Offload intensive tasks from the main thread to prevent UI blocking.
- Components Using Workers:
Audio Processor Worker
for audio format conversion.Transcription Worker
for handling WebSocket communication.
Implementation Details
Threading in TranscriptionHandler
Why It’s Critical:
- Efficiently manages multiple WebSocket sessions.
- Ensures that the heavy transcription tasks do not block the main event loop.
Key Points:
- TranscriptionHandler maintains separate input and output queues for each session.
- TranscriptionWorker thread processes requests from a shared queue and distributes results back to the appropriate session.
Code Highlights:
# handler.py
class TranscriptionHandler:
def __init__(self):
self.worker = TranscriptionWorker(self.transcription_service, self.loop)
self.worker.start()
async def connect(self, websocket: WebSocket, workspace_id: str, step_id: str) -> str:
# Assign session_id and set up queues
# Start background tasks for receiving and sending data
Separation of Concerns
- TranscriptionHandler: Manages session lifecycle, threading, and coordination of requests and responses.
- TranscriptionService: Focuses solely on the transcription logic, making it modular and testable.
Audio Format Conversion in Audio Processor Worker
Conversion Steps:
- Resampling: Adjust the sample rate to 16kHz if necessary.
- Encoding: Package the PCM data into a WAV file format with correct headers.
Code Snippet:
// audio-processor.worklet.js
process(inputs, outputs, parameters) {
// Collect samples
// When enough samples are collected for a chunk:
// - Convert to 16-bit PCM
// - Create WAV header
// - Combine header and PCM data
// - Send chunk via postMessage
}
Conclusion
This implementation serves as a foundation for building live transcription applications. The modular architecture allows for easy integration of new features and performance optimizations. The complete source code is available on GitHub https://github.com/ryan-zheng-teki/live-transcription-whisper, and a live demo https://www.youtube.com/watch?v=m8yYaIrgBNY is accessible online. Feel free to use it in your application.