[nu-demo] kokoro TTS produces mostly silence on ort rc.12 / ONNX 1.24 — quality regression vs rc.11 (home#173 follow-up) #197
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
TTS audio generated via
POST /tts(kokoro-micro 1.0.0 backend) on the ort 2.0.0-rc.12 / ONNX 1.24.4 build produces audio where only the first 3-4 syllables are clearly audible. The remaining 70%+ of the output WAV is near-silence (RMS < 200), with intermittent low-amplitude noise/whispers in place of the rest of the input text.The user described it as: "1 word out of 2 is murmurs."
Reproduction (against herodemo, rc.12 + ONNX 1.24.4 build)
Three test phrases, all with
voice=af_heart,speed=1.0. RMS analyzed in 100ms windows:Hello world.Testing one two three.This is a test of the voice synthesis system.The quick brown fox jumps over the lazy dog.Pattern across all phrases: first ~1 second of speech is clear (RMS 1000-5000), then degrades into long stretches of near-silence (RMS 0-200), occasionally surfacing for the final word or two.
WAV header confirms valid output format (RIFF, 24kHz mono 16-bit PCM, kokoro's expected output).
Hypothesis
ort 1.23 → 1.24 introduced a numerical regression in an operator the kokoro vocoder/decoder relies on. The model file (
kokoro_v1.0.onnx) is identical between builds; only the runtime changed.This is exactly the failure mode Kristof was wary of when he asked "is there conflict somewhere? or is it too new?" on home#173.
Needs
ort = "=2.0.0-rc.11"(the previous pin), run the same TTS calls, compare RMS pattern. If rc.11 produces clean audio for the same input, the regression is confirmed.Decision impact
This blocks the home#173 follow-up squash-merge — pulling hero_voice up to rc.12 was the whole point of that work. If TTS is degraded on rc.12, the right call may be back to Kristof's original instinct: stay on rc.11/ONNX 1.23 across the stack, even though it costs us the modern runtime.
Three branches currently on
development_mikwaiting on this verdict:f24619be3e4d6862af07aFiled during herodemo end-to-end validation 2026-04-27. Tracking against home#173.
Signed-off-by: mik-tf
Update: A/B test result — rc.12 is INNOCENT
Rebuilt hero_voice on
ort = "=2.0.0-rc.11"on herodemo (kept ONNX 1.24.4 sinceload-dynamicdoesn't care). Hit the/ttsendpoint with the same two test phrases the original report used:RMS analysis confirms — same peak (5980), same 73% silent windows, same per-window energy pattern down to the digit. The audio dropouts are not introduced by ort rc.12 / ONNX 1.24. This pre-existed on the rc.11 / ONNX 1.23.2 stack — we just never listened to long-form output.
Implication
development_mikbranches (hero_voice, hero_embedder, hero_skills) — no rc.12 regression to worry about. Reverting to rc.11 would not have improved TTS quality.Reframed scope
This ticket should be retitled / repurposed to track the kokoro-micro 1.0.0 audio quality issue. Suggested rename:
Reproduction is in the original report. Investigation paths:
af_heartspecific)kokoro-micro::TtsEngine(what does it do with text > N tokens)No timeline pressure since the rc.12 work proceeds. Filing as P3 (cosmetic — TTS still produces some audio, AI Assistant remains usable for written replies).
Signed-off-by: mik-tf
Root cause + fix landed
Root cause:
kokoro-micro::TtsEngine::parse_voice_stylewas usingvoice_style[0..256]as the style vector for every synthesis. Kokoro voice files (0.bin) are stored as[max_seq_len, 1, 256]f32 tables — each row is the style embedding the model expects for sentences of that token length. Reading from offset 0 always feedsstyle[0](the embedding for an empty zero-token sentence), which puts the model out of distribution and produces the audible "first ~1 second clear, rest is murmurs/silence" dropout. Confirmed against upstream Kokoros (lucasjinreal/Kokoros), which does this correctly viamix_styles(name, tokens.len()).A secondary cosmetic issue — three
$$$boundary pad tokens instead of one — was fixed in the same fork.Fix landed
Forked kokoro-micro v1.0.0 to https://forge.ourworld.tf/lhumina_code/kokoro-micro on branch
development:925cd6fb8f0de2$$$)98dcc6ehero_voice now pins to the fork via git dep —
e6573f0on development.Verification
User-confirmed in browser: "I confirm on herodemo.gent01.grid.tf the TTS works now".
Long-term
Next: open a PR upstream against
DavidValin/kokoro-micro(GitHub) so the fork can be retired in favour of an upstream release. Will track here.Closing this once the upstream PR is filed.
Signed-off-by: mik-tf
Upstream PR opened
https://github.com/DavidValin/kokoro-micro/pull/2 — both fixes rebased on the upstream
mainbranch (e59ca2d), src/lib.rs only, public API unchanged.Closing — fork serves the live fix and the upstream PR will let us retire the fork once it merges.
Signed-off-by: mik-tf