This project is a fork of the original speech-to-text repository by reriiasu:
https://github.com/reriiasu/speech-to-text
The original project provides real-time transcription using faster-whisper. This fork extends it with real-time translation and latency optimization, designed for live lecture support (e.g., English Zoom lectures → Japanese).
Accepts audio input from a microphone using a Sounddevice. By using Silero VAD(Voice Activity Detection), silent parts are detected and recognized as one voice data. This audio data is converted to text using Faster-Whisper.
The HTML-based GUI allows you to check the transcription results and make detailed settings for the faster-whisper.
- Real-time translation (e.g., English → Japanese)
- Differential translation updates (reduce redundant re-translation)
- Optimized OpenAI API token usage
- Lower latency buffering strategy
- uv-based reproducible environment
If the sentences are well separated, the transcription takes less than a second.

Large-v2 model Executed with CUDA 11.7 on a NVIDIA GeForce RTX 3060 12GB.
This project uses uv for environment and dependency management.
If you don't have uv installed:
From the project root:
uv sync
This will:
- Create a virtual environment
- Install all required dependencies
Dependency versions are locked via uv.lock for reproducibility.
uv run python -m speech_to_text
- Select "App Settings" and configure the settings.
- Select "Model Settings" and configure the settings.
- Select "Transcribe Settings" and configure the settings.
- Select "VAD Settings" and configure the settings.
- Start Transcription
If you want to use real-time translation with the OPENAI API, please set the OPENAI_API_KEY in the .env file.
The .env file will be loaded automatically at startup.
For example(in .env file):
OPENAI_API_KEY=your_openai_api_key_here
When using GPT-4o-mini with optimized differential translation, real-time lecture translation typically costs on the order ofa few cents per hour.
Actual cost depends on:
- Model choice
- Buffer size
- Speaking speed
- Translation frequency
- If you select local_model in "Model size or path", the model with the same name in the local folder will be referenced.
The following updates were made in the original repository by reriiasu.
- Add generate audio files from input sound.
- Add synchronize audio files with transcription.
Audio and text highlighting are linked.
- Add transcription from audio files.(only wav format)
- Add Send transcription results from a WebSocket server to a WebSocket client.
Example of use: Display subtitles in live streaming.
- Add generate SRT files from transcription result.
- Add support for mp3, ogg, and other audio files.
Depends on Soundfile support. - Add setting to include non-speech data in buffer.
While this will increase memory usage, it will improve transcription accuracy.
- Add non-speech threshold setting.
- Add Text proofreading option via OpenAI API.
Transcription results can be proofread.
- Add feature where audio and word highlighting are synchronized.
if Word Timestamps is true.
- Support for repetition_penalty and no_repeat_ngram_size in transcribe_settings.
- Updating packages.
- Support "large-v3" model.
- Update faster-whisper requirement to include the latest version "0.10.0".
- Support "Faster Distil-Whisper" model.
- Update faster-whisper requirement to include the latest version "1.0.3".
- Updating packages.
- Add run.bat for Windows.
- Added real-time translation feature
- Implemented differential translation updates
- Optimized OpenAI API token usage
- Migrated environment management to uv
This project is licensed under the MIT License.
Original work: Copyright (c) 2023 reriiasu
Modifications: Copyright (c) 2026 constantpi
See LICENSE for details.

