Style-Bert-VITS2 is a text-to-speech system based on Bert-VITS2 that focuses on highly controllable voice styles and emotional expression. It takes the original Bert-VITS2 v2.1 and its Japanese-Extra variant and extends them so you can control emotion and speaking style with fine-grained intensity, not just choose a generic tone. The project targets both power users and beginners: Windows users without Git or Python can install and run it using bundled .bat scripts, while advanced users can work with virtual environments, uv, and Python tooling. It includes a full GUI editor to script dialogue, set different styles per line, edit dictionaries, and save/load projects, plus a separate web UI and Colab notebooks for training and experimentation. For those who only need synthesis, the project is published as a Python library (pip install style-bert-vits2) and can run on CPU without an NVIDIA GPU, though training still requires GPU hardware.
Features
- Fine-grained control of emotion and speaking style via learned style vectors
- Beginner-friendly Windows installers plus advanced Python/uv setup options
- Graphical editor for multi-line scripts, per-line style settings, and dictionary editing
- Dataset tools for slicing long recordings, auto-transcribing, and preparing training corpora
- CPU-only inference support via the style-bert-vits2 Python package for synthesis-focused use cases
- Built-in FastAPI server to expose text-to-speech as an HTTP API for external integrations