Live Speech-to-Text Transcription Implementation

8 min readNov 23, 2024

Abstract

This article explains how to implement live speech-to-text transcription using web technologies. It covers the system architecture, components, and techniques for achieving live performance, including the audioStore and transcriptionStore for frontend management, and the TranscriptionHandler for backend operations.

Introduction

live speech-to-text transcription enables live captioning, virtual assistants, and voice response systems. This implementation uses Vue.js and Nuxt.js with Pinia for the frontend, Python and FastAPI for the backend, WebSockets for communication, and OpenAI’s Whisper for transcription.

Architecture Overview

The implementation follows a client-server architecture divided into two main parts:

Frontend: Handles audio capture, processing, state management, and user interaction.
Backend: Manages transcription services, threading, and WebSocket communication.

High-Level Architecture Diagram

You can visualize the architecture using the following PlantUML diagram:´

Figure 1: High-level architecture of the live transcription implementation.

Frontend Components

Overview

The frontend is built using Vue.js and Nuxt.js, with Pinia as the state management library. It consists of:

Components: Primarily responsible for user interface and display.
Pinia Stores: Manage application state and logic.
Web Workers: Perform computationally intensive tasks without blocking the main thread.

AudioRecorder Component (`AudioRecorder.vue`)

Role:

Provides the user interface for recording controls (start/stop recording).
Displays the transcribed text in live.
Interacts with audioStore and transcriptionStore for recording and transcription functionalities.

Key Implementation Details:

The component primarily handles user interactions and displays information.
It delegates recording and transcription logic to the respective Pinia stores.

Code Snippet:

<template>
  <div class="space-y-4">
    <div class="flex items-center space-x-2">
      <button @click="handleRecordingToggle" :disabled="disabled || audioStore.isStopping">
        <span>{{ buttonText }}</span>
      </button>
      <!-- Additional UI elements -->
    </div>
    <!-- Display transcribed text -->
    <div v-if="conversationStore.currentRequirement">
      <p>{{ conversationStore.currentRequirement }}</p>
    </div>
  </div>
</template>

<script setup lang="ts">
import { computed } from 'vue';
import { useAudioStore } from '~/stores/audioStore';
import { useTranscriptionStore } from '~/stores/transcriptionStore';
import { useConversationStore } from '~/stores/conversationStore';
const audioStore = useAudioStore();
const transcriptionStore = useTranscriptionStore();
const conversationStore = useConversationStore();
const buttonText = computed(() => {
  if (audioStore.isStopping) return 'Stopping...';
  return audioStore.isRecording ? 'Recording...' : 'Start Recording';
});
const handleRecordingToggle = async () => {
  if (!audioStore.isRecording) {
    await audioStore.startRecording(workspaceId, stepId);
  } else {
    await audioStore.stopRecording(workspaceId, stepId);
  }
};
</script>

`audioStore` (Pinia Store)

Role:

Manages the audio recording process.
Interfaces with the Audio Processor Worker for audio processing.
Handles starting and stopping of recording sessions.
Maintains the state of recorded audio chunks.

Key Implementation Details:

Utilizes the Web Audio API to capture audio from the user’s microphone.
Uses an AudioWorklet (Audio Processor Worker) to process raw audio data into WAV format.
Stores audio chunks and provides functionality to manage them (e.g., download, delete).

Code Snippet:

// audioStore.ts
import { defineStore } from 'pinia';
export const useAudioStore = defineStore('audio', {
  state: () => ({
    isRecording: false,
    audioChunks: [],
    // other state properties
  }),
  actions: {
    async startRecording(workspaceId: string, stepId: string) {
      // Initialize media stream and audio context
      // Connect to Audio Processor Worker
      // Start recording
    },
    async stopRecording(workspaceId: string, stepId: string) {
      // Stop media stream
      // Flush Audio Processor Worker
      // Finalize transcription
    },
    // Additional actions for managing audio chunks
  },
});

`transcriptionStore` (Pinia Store)

Role:

Manages communication with the backend via WebSocket.
Handles sending processed audio data for transcription.
Receives transcribed text and updates the state accordingly.
Interfaces with the Transcription Worker for WebSocket communication.

Key Implementation Details:

Establishes and maintains a WebSocket connection with the backend.
Sends audio data received from audioStore to the backend.
Updates the conversationStore with transcribed text.

Code Snippet:

// transcriptionStore.ts
import { defineStore } from 'pinia';
export const useTranscriptionStore = defineStore('transcription', {
  state: () => ({
    isConnected: false,
    transcription: '',
    // other state properties
  }),
  actions: {
    initializeWorker() {
      this.worker = new Worker(new URL('~/workers/transcriptionWorker.ts', import.meta.url), {
        type: 'module',
      });
      this.setupWorkerHandlers();
    },
    setupWorkerHandlers() {
      this.worker.onmessage = (event) => {
        const { type, payload } = event.data;
        if (type === 'MESSAGE') {
          this.handleWorkerMessage(payload);
        }
        // Handle other message types
      };
    },
    handleWorkerMessage(message) {
      if (message.type === 'transcription') {
        this.transcription += message.text + ' ';
        // Update conversationStore with new transcription
      }
    },
    async sendAudioChunk(audioChunk: ArrayBuffer) {
      // Send audio chunk to Transcription Worker
    },
    // Additional actions for managing WebSocket connection
  },
});

Audio Processor Worker (`audio-processor.worklet.js`)

Purpose:

Processes raw audio data captured by the audioStore.
Converts raw audio streams into WAV format compatible with the Whisper model.
Operates as an AudioWorklet, which is a high-performance audio processing script that runs on the audio rendering thread.

Key Implementation Details:

Handles audio data in small chunks for live processing.
Resamples audio to the target sample rate (e.g., 16kHz).
Encodes audio data into 16-bit PCM WAV format.

Code Snippet:

// audio-processor.worklet.js
class AudioChunkProcessor extends AudioWorkletProcessor {
  constructor(options) {
    super();
    // Initialize processor options
  }
process(inputs, outputs, parameters) {
    // Process audio data
    // Resample and encode to WAV
    // Post message with processed audio data
    return true;
  }
}
registerProcessor('audio-chunk-processor', AudioChunkProcessor);

Transcription Worker (`transcriptionWorker.ts`)

Responsibilities:

Establishes a WebSocket connection with the backend server.
Sends processed audio data to the backend for transcription.
Receives transcribed text from the backend and relays it to the transcriptionStore.

Key Implementation Details:

Manages the WebSocket lifecycle (connect, disconnect, error handling).
Handles binary data transmission for audio chunks.
Parses incoming messages and forwards them to the transcriptionStore.

Code Snippet:

// transcriptionWorker.ts
let socket;
onmessage = (event) => {
  const { type, payload } = event.data;
  switch (type) {
    case 'CONNECT':
      initWebSocket(payload);
      break;
    case 'SEND_AUDIO':
      socket.send(payload.wavData);
      break;
    // Handle other message types
  }
};
function initWebSocket({ workspaceId, stepId, transcriptionWsEndpoint }) {
  socket = new WebSocket(`${transcriptionWsEndpoint}/${workspaceId}/${stepId}`);
  socket.onmessage = (event) => {
    const message = JSON.parse(event.data);
    postMessage({ type: 'MESSAGE', payload: message });
  };
}

Backend Components

Overview

The backend is implemented using Python and FastAPI. It consists of:

Transcription Handler (handler.py): Manages threading, multiple WebSocket sessions, and coordinates transcription requests.
Transcription Service (service.py): Handles the transcription of audio data without managing threading.

Transcription Handler (`handler.py`)

Manages:

WebSocket connections with clients.
Threading and coordination of multiple WebSocket sessions.
Queuing transcription requests and distributing them to the transcription worker.
Processing transcription results and sending them back to clients.

Key Implementation Details:

Uses threading and asynchronous programming to handle multiple connections efficiently.
Maintains separate queues for transcription requests and results.
The TranscriptionWorker thread processes transcription requests sequentially.
Each client session is identified by a unique session_id.

Code Snippet:

# handler.py
import asyncio
import queue
import threading
from fastapi import WebSocket
from .service import TranscriptionService

class TranscriptionHandler:
    def __init__(self):
        self.transcription_service = TranscriptionService()
        self.active_connections = {}
        self.output_queues = {}
        self.loop = asyncio.get_event_loop()
        self.worker = TranscriptionWorker(self.transcription_service, self.loop)
        self.worker.start()

async def connect(self, websocket: WebSocket, workspace_id: str, step_id: str) -> str:
        await websocket.accept()
        session_id = str(uuid.uuid4())
        self.active_connections[session_id] = websocket
        self.output_queues[session_id] = asyncio.Queue()
        self.worker.result_queues[session_id] = self.output_queues[session_id]
        asyncio.create_task(self._receive_audio(websocket, session_id))
        asyncio.create_task(self._send_results(websocket, session_id))
        await websocket.send_json({"type": "session_init", "session_id": session_id})
        return session_id
    async def _receive_audio(self, websocket: WebSocket, session_id: str):
        while True:
            audio_data = await websocket.receive_bytes()
            request = TranscriptionRequest(session_id=session_id, audio_data=audio_data, timestamp=time.time())
            self.worker.request_queue.put_nowait(request)
    async def _send_results(self, websocket: WebSocket, session_id: str):
        while True:
            result = await self.output_queues[session_id].get()
            await websocket.send_json(result)

TranscriptionWorker:

A separate thread that processes transcription requests sequentially.
Interacts with the TranscriptionService to perform the actual transcription.

class TranscriptionWorker(threading.Thread):
    def __init__(self, transcription_service: TranscriptionService, loop: asyncio.AbstractEventLoop):
        super().__init__()
        self.transcription_service = transcription_service
        self.request_queue = queue.Queue()
        self.result_queues = {}
        self.loop = loop

def run(self):
        while True:
            request = self.request_queue.get()
            if request is None:
                break  # Shutdown signal
            transcription = self.transcription_service.transcribe(request.audio_data)
            if request.session_id in self.result_queues:
                result_queue = self.result_queues[request.session_id]
                asyncio.run_coroutine_threadsafe(
                    result_queue.put({
                        "type": "transcription",
                        "text": transcription,
                        "timestamp": request.timestamp
                    }),
                    self.loop
                )

Transcription Service (`service.py`)

Handles:

Transcription of audio data using the OpenAI Whisper model.
Does not manage threading; it is called by the TranscriptionWorker when processing requests.

Key Implementation Details:

Initializes the Whisper model and processor.
Determines the appropriate device and data type based on system capabilities.
Performs transcription without concern for threading, as threading is managed by TranscriptionHandler and TranscriptionWorker.

Code Snippet:

# service.py
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

class TranscriptionService(metaclass=SingletonMeta):
    def __init__(self):
        self.device, self.torch_dtype = self._setup_device_and_dtype()
        model_id = "openai/whisper-large-v3-turbo"
        self.model = AutoModelForSpeechSeq2Seq.from_pretrained(
            model_id,
            torch_dtype=self.torch_dtype,
            low_cpu_mem_usage=True,
            use_safetensors=True,
        ).to(self.device)
        self.processor = AutoProcessor.from_pretrained(model_id)
        device_arg = self._get_pipeline_device()
        self.pipe = pipeline(
            "automatic-speech-recognition",
            model=self.model,
            tokenizer=self.processor.tokenizer,
            feature_extractor=self.processor.feature_extractor,
            torch_dtype=self.torch_dtype,
            device=device_arg,
        )
        self.sampling_rate = 16000
    def transcribe(self, audio_data: bytes) -> str:
        transcription = self.pipe(audio_data)
        return transcription.get("text", "").strip()

Emphasis on Separation of Concerns:

TranscriptionHandler: Manages threading and handles multiple WebSocket sessions, ensuring that transcription requests are processed efficiently.
TranscriptionService: Focuses solely on the transcription logic without managing threading, making it reusable and easier to maintain.

Data Flow and Communication

The data flow involves multiple components working in tandem to capture, process, transmit, and transcribe audio data in live.

Data Flow Diagram

Figure 2: Detailed data flow between frontend and backend components, highlighting threading and session management.

Process Steps

Audio Capture: User initiates recording via the AudioRecorder component.
Start Recording: AudioRecorder calls audioStore.startRecording().
Audio Processing: audioStore sets up the Audio Processor Worker to process raw audio data into WAV format.
Data Transmission: Processed audio chunks are sent from audioStore to transcriptionStore, which sends them to the backend via the Transcription Worker.
Session Management: TranscriptionHandler accepts the WebSocket connection and assigns a unique session_id.
Threading and Queuing: TranscriptionHandler queues transcription requests and manages threading via the TranscriptionWorker thread.
Transcription Handling: TranscriptionWorker processes requests sequentially and calls TranscriptionService for transcription.
Result Delivery: Transcribed text is placed in the output queue and sent back to the frontend through the WebSocket connection.
Update State: Transcription Worker sends the transcribed text to transcriptionStore, which updates the conversationStore.
Display: AudioRecorder component displays the transcribed text in live.

Key Techniques for live Performance

Thread Management in Backend

Purpose: Efficiently handle multiple client connections and transcription requests.
Implementation:
The TranscriptionHandler uses asynchronous tasks to manage WebSocket connections.
A separate TranscriptionWorker thread processes transcription requests from all clients sequentially.
Requests are queued, and results are sent back via output queues specific to each session.

Benefits:

Resource Efficiency: By using a single worker thread for transcription, resource usage is optimized, especially important when dealing with heavy models like Whisper.
Scalability: Can handle multiple clients without spawning excessive threads or processes.

Audio Chunking and Asynchronous Processing

Purpose: Reduces latency and ensures smooth live transcription.
Implementation:
Audio data is processed and sent in chunks to allow for incremental transcription.
Asynchronous programming is used both on the frontend and backend to handle tasks without blocking the main thread.

Worker Threads on Frontend

Purpose: Offload intensive tasks from the main thread to prevent UI blocking.
Components Using Workers:
Audio Processor Worker for audio format conversion.
Transcription Worker for handling WebSocket communication.

Implementation Details

Threading in `TranscriptionHandler`

Why It’s Critical:

Efficiently manages multiple WebSocket sessions.
Ensures that the heavy transcription tasks do not block the main event loop.

Key Points:

TranscriptionHandler maintains separate input and output queues for each session.
TranscriptionWorker thread processes requests from a shared queue and distributes results back to the appropriate session.

Code Highlights:

# handler.py
class TranscriptionHandler:
    def __init__(self):
        self.worker = TranscriptionWorker(self.transcription_service, self.loop)
        self.worker.start()
  async def connect(self, websocket: WebSocket, workspace_id: str, step_id: str) -> str:
        # Assign session_id and set up queues
        # Start background tasks for receiving and sending data

Separation of Concerns

TranscriptionHandler: Manages session lifecycle, threading, and coordination of requests and responses.
TranscriptionService: Focuses solely on the transcription logic, making it modular and testable.

Audio Format Conversion in Audio Processor Worker

Conversion Steps:

Resampling: Adjust the sample rate to 16kHz if necessary.
Encoding: Package the PCM data into a WAV file format with correct headers.

Code Snippet:

// audio-processor.worklet.js
process(inputs, outputs, parameters) {
  // Collect samples
  // When enough samples are collected for a chunk:
  // - Convert to 16-bit PCM
  // - Create WAV header
  // - Combine header and PCM data
  // - Send chunk via postMessage
}

Conclusion

This implementation serves as a foundation for building live transcription applications. The modular architecture allows for easy integration of new features and performance optimizations. The complete source code is available on GitHub https://github.com/ryan-zheng-teki/live-transcription-whisper, and a live demo https://www.youtube.com/watch?v=m8yYaIrgBNY is accessible online. Feel free to use it in your application.

Live Speech-to-Text Transcription Implementation

Abstract

Introduction

Architecture Overview

High-Level Architecture Diagram

Frontend Components

Overview

AudioRecorder Component (`AudioRecorder.vue`)

`audioStore` (Pinia Store)

`transcriptionStore` (Pinia Store)

Audio Processor Worker (`audio-processor.worklet.js`)

Transcription Worker (`transcriptionWorker.ts`)

Backend Components

Overview

Transcription Handler (`handler.py`)

Transcription Service (`service.py`)

Data Flow and Communication

Data Flow Diagram

Process Steps

Key Techniques for live Performance

Thread Management in Backend

Audio Chunking and Asynchronous Processing

Worker Threads on Frontend

Implementation Details

Threading in `TranscriptionHandler`

Separation of Concerns

Audio Format Conversion in Audio Processor Worker

Conclusion

Written by Ryan Zheng

Responses (1)

Live Speech-to-Text Transcription Implementation

Abstract

Introduction

Architecture Overview

High-Level Architecture Diagram

Frontend Components

Overview

AudioRecorder Component (AudioRecorder.vue)

audioStore (Pinia Store)

transcriptionStore (Pinia Store)

Audio Processor Worker (audio-processor.worklet.js)

Transcription Worker (transcriptionWorker.ts)

Backend Components

Overview

Transcription Handler (handler.py)

Transcription Service (service.py)

Data Flow and Communication

Data Flow Diagram

Process Steps

Key Techniques for live Performance

Thread Management in Backend

Audio Chunking and Asynchronous Processing

Worker Threads on Frontend

Implementation Details

Threading in TranscriptionHandler

Separation of Concerns

Audio Format Conversion in Audio Processor Worker

Conclusion

Written by Ryan Zheng

Responses (1)

AudioRecorder Component (`AudioRecorder.vue`)

`audioStore` (Pinia Store)

`transcriptionStore` (Pinia Store)

Audio Processor Worker (`audio-processor.worklet.js`)

Transcription Worker (`transcriptionWorker.ts`)

Transcription Handler (`handler.py`)

Transcription Service (`service.py`)

Threading in `TranscriptionHandler`