- C 59.8%
- D 31.1%
- Shell 3.3%
- Rust 1.5%
- Python 1.3%
- Other 3%
Kokoro voice files (0.bin) are stored as [max_seq_len, 1, 256] f32
tables. Each row is the style embedding the model expects for input
sequences of that token length. load_voices flattens the file into a
single Vec<f32>, but parse_voice_style was iterating from offset 0
and taking the first 256 floats — i.e. style[0], the embedding for
an *empty* (zero-token) sequence — for every synthesis call.
The audible result: clean speech for the first ~600ms-1000ms then a
collapse into low-amplitude murmurs / silence. Reproduces deterministic-
ally on identical input across multiple voices and over the entire
ort 2.0.0-rc.11 ↔ 2.0.0-rc.12 range — i.e. unrelated to the runtime,
the model file, or speed scaling.
Fix: lift the style lookup out of synthesize_with_options and run it
inside synthesize_segment (which is the place that already tokenises
the phoneme string), passing tokens.len() to parse_voice_style. The
helper now picks the [tokens_len*256 .. tokens_len*256+256] slice of
the flat per-voice array, clamping to the highest available row for
inputs longer than the stored table.
Mirrors how upstream Kokoros (lucasjinreal/Kokoros) handles this —
their mix_styles(name, tokens.len()) returns style[tokens_len][0].
Verified end-to-end on a herodemo deploy:
metric before fix after fix
silent windows (<200) 32 / 48 (66%) 12 / 52 (23%)
normal windows (>=1000) 7 / 48 34 / 52
duration 4.75s 5.20s
audible content span ~600-1200ms ~400-4400ms
— same input ('The quick brown fox jumps over the lazy dog.', voice
af_heart, speed 1.0). The fix produces continuous speech across the
whole utterance instead of dropping after the first ~1s.
Signed-off-by: mik-tf
|
||
|---|---|---|
| assets | ||
| examples | ||
| src | ||
| target | ||
| Cargo.lock | ||
| Cargo.toml | ||
| Cargo.toml.orig | ||
| README.md | ||
kokoro-micro
A minimal, embeddable Text-to-Speech (TTS) library for Rust using the Kokoro 82M parameter model.
This is a reduced version of kokoro-tiny created by by 8b-is.
Features
- Minimal dependencies - Only essential crates for TTS synthesis
- Auto-downloading - Model files (310MB + 27MB) download automatically to
~/.cache/k/ - Multiple voices - Support for various voice styles with mixing capability
- Speed & gain control - Adjust speech speed and volume
- WAV export - Save synthesized audio to WAV files
- Long text support - Automatic chunking and crossfading for longer texts
- Silent by default - No output unless
KOKORO_DEBUG=1is set
Installation
Add to your Cargo.toml:
[dependencies]
kokoro-micro = "0.2.0"
tokio = { version = "1", features = ["rt", "macros"] }
Quick Start
use kokoro_micro::TtsEngine;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Initialize TTS engine (downloads model on first run)
let mut tts = TtsEngine::new().await?;
// Synthesize speech
// Parameters: text, voice (None for default), speed, gain, language
let audio = tts.synthesize_with_options(
"Hello world!",
None, // voice: None = default "af_sky"
1.0, // speed: 1.0 = normal
1.0, // gain: 1.0 = normal volume
Some("en") // language
)?;
// Save to WAV file
tts.save_wav("output.wav", &audio)?;
Ok(())
}
API Reference
TtsEngine
Main struct for text-to-speech synthesis.
Methods
-
new() -> Result<Self, String>
Create a new TTS engine. Downloads model files to~/.cache/k/on first run. -
with_paths(model_path: &str, voices_path: &str) -> Result<Self, String>
Create engine with custom model file paths. -
voices() -> Vec<String>
List all available voice names. -
synthesize_with_options(text: &str, voice: Option<&str>, speed: f32, gain: f32, lang: Option<&str>) -> Result<Vec<f32>, String>
Synthesize text to audio samples.text- Text to synthesizevoice- Voice name (e.g., "af_sky", "af_nicole", "am_adam") or None for defaultspeed- Speech speed (0.5 = slower, 1.0 = normal, 2.0 = faster)gain- Volume multiplier (0.5 = quieter, 1.0 = normal, 2.0 = louder)lang- Language code (e.g., "en", "es", "fr") or None for default "en"
-
save_wav(path: &str, audio: &[f32]) -> Result<(), String>
Save audio samples to a WAV file.
Voice Mixing
You can mix multiple voices by using weighted combinations:
// Mix 40% af_sky + 50% af_nicole
let audio = tts.synthesize_with_options(
"Hello!",
Some("af_sky.4+af_nicole.5"),
1.0,
1.0,
Some("en")
)?;
Available Voices
Common voices include:
af_sky(default) - Female, gentleaf_nicole- Femaleaf_bella- Femaleam_adam- Maleam_michael- Male
Use tts.voices() to list all available voices.
Debug Logging
By default, kokoro-micro runs silently with no console output. To enable debug logging (model download progress, synthesis details, etc.), set the KOKORO_DEBUG environment variable:
# Enable debug logging
KOKORO_DEBUG=1 cargo run --example simple
# Or in your code
std::env::set_var("KOKORO_DEBUG", "1");
Debug logging shows:
- Model download progress
- Long-form synthesis chunking information
- Phoneme conversion details
- Audio generation statistics
Example
See examples/simple.rs:
# Run without debug output
cargo run --example simple
# Run with debug output
KOKORO_DEBUG=1 cargo run --example simple
Features
Optional Features
cuda- Enable CUDA acceleration for ONNX Runtime
[dependencies]
kokoro-micro = { version = "0.2.0", features = ["cuda"] }
Model Files
Model files are automatically downloaded on first use to $HOME/.cache/k/:
$HOME/.cache/k/0.onnx(310MB) - Kokoro ONNX model$HOME/.cache/k/0.bin(27MB) - Voice embeddings
The same cache directory is used on all platforms (Linux, macOS, Windows):
- Linux/macOS:
$HOME/.cache/k/(e.g.,/home/user/.cache/k/) - Windows:
%USERPROFILE%/.cache/k/(e.g.,C:\Users\Username\.cache\k\)
Files are cached and shared across all applications using kokoro-micro.
License
Apache-2.0
Credits
Built with the Kokoro 82M parameter TTS model. Reduced version from kokoro-tiny by 8b-is.