WhisperJAV is an open-source speech transcription pipeline designed specifically for generating subtitles for Japanese adult video content. The project addresses challenges that standard speech recognition models face when transcribing this type of audio, which often includes low signal-to-noise ratios and large numbers of non-verbal vocalizations. Traditional automatic speech recognition systems can misinterpret these sounds as words, leading to inaccurate transcripts. WhisperJAV introduces a specialized pipeline that separates text generation from timestamp alignment, allowing the system to generate transcripts and then align them with audio using forced alignment techniques. The framework supports several speech recognition models, including Qwen-based ASR systems and fine-tuned Whisper models trained on domain-specific dialogue.
Features
- Domain-specific transcription pipeline for Japanese adult video audio
- Support for multiple ASR models including Qwen3-ASR and anime-whisper
- Two-stage pipeline separating transcription and timestamp alignment
- Forced alignment system for precise word-level subtitle timing
- Multiple processing modes optimized for different audio conditions
- Configurable sensitivity settings to reduce transcription hallucinations