How does voice cloning work in Fish Audio S2?

Voice cloning is achieved by placing reference audio tokens in the system prompt, and SGLang's RadixAttention caches these states for efficient reuse, reducing overhead and enabling high prefix-cache hit rates.

What languages are supported by Fish Audio S2?

It supports over 80 languages, with Tier 1 languages including Japanese, English, and Chinese for highest quality, and Tier 2 languages such as Korean, Spanish, and German, among others.

Is Fish Audio S2 open-source?

Yes, Fish Audio S2 is fully open-source, with model weights, fine-tuning code, and a production-ready inference stack available on GitHub and HuggingFace.

How efficient is the inference performance?

On a single NVIDIA H200 GPU, it achieves a Real-Time Factor of 0.195, time-to-first-audio of ~100ms, and throughput of over 3,000 acoustic tokens per second, leveraging LLM-native serving optimizations.

Fish Audio S2

Real Expressive AI Voices

Visit

What is Fish Audio S2

Fish Audio S2 is an open-source text-to-speech model that provides fine-grained control over voice prosody and emotion using natural-language cues like [whisper] or [laughing nervously]. It supports over 80 languages and enables multi-speaker dialogue generation in a single pass with a production-ready streaming inference engine. Built on a dual-autoregressive architecture, it delivers high-quality, expressive AI voices suitable for various applications.

Key Features

Natural-language control for fine-grained prosody and emotion

Multi-speaker dialogue generation in one pass

Support for 80+ languages with high-quality output

Open-source with model weights, fine-tuning code, and inference engine

Efficient production streaming via SGLang-based architecture

Use Cases

Content creators for adding emotional voiceovers to videos and podcasts
Game developers for creating dynamic and expressive character dialogues
Educators for generating interactive learning materials with natural-sounding speech
Accessibility tool developers for enhancing text-to-speech applications for visually impaired users
Voice cloning services for personalized and realistic voice synthesis

Why do startups need this tool?

Fish Audio S2 is ideal for startups as it offers a cost-effective, open-source solution for adding expressive AI voices to products, enhancing user engagement without high licensing fees. Its production-ready streaming engine ensures scalability, while natural-language control allows for easy customization and integration into various applications, giving startups a competitive edge in voice technology.