[nu-demo] kokoro TTS produces mostly silence on ort rc.12 / ONNX 1.24 — quality regression vs rc.11 (home#173 follow-up) #197

Closed
opened 2026-04-27 14:42:35 +00:00 by mik-tf · 3 comments
Owner

Symptom

TTS audio generated via POST /tts (kokoro-micro 1.0.0 backend) on the ort 2.0.0-rc.12 / ONNX 1.24.4 build produces audio where only the first 3-4 syllables are clearly audible. The remaining 70%+ of the output WAV is near-silence (RMS < 200), with intermittent low-amplitude noise/whispers in place of the rest of the input text.

The user described it as: "1 word out of 2 is murmurs."

Reproduction (against herodemo, rc.12 + ONNX 1.24.4 build)

Three test phrases, all with voice=af_heart, speed=1.0. RMS analyzed in 100ms windows:

Text Duration Silent windows (<200 RMS) Normal windows (≥1000)
Hello world. 2.70s 18/27 5/27
Testing one two three. 3.25s 21/33 8/33
This is a test of the voice synthesis system. 4.42s 30/45 7/45
The quick brown fox jumps over the lazy dog. 4.75s 32/48 7/48
18-word ONNX migration sentence 7.50s 55/75 13/75

Pattern across all phrases: first ~1 second of speech is clear (RMS 1000-5000), then degrades into long stretches of near-silence (RMS 0-200), occasionally surfacing for the final word or two.

WAV header confirms valid output format (RIFF, 24kHz mono 16-bit PCM, kokoro's expected output).

Hypothesis

ort 1.23 → 1.24 introduced a numerical regression in an operator the kokoro vocoder/decoder relies on. The model file (kokoro_v1.0.onnx) is identical between builds; only the runtime changed.

This is exactly the failure mode Kristof was wary of when he asked "is there conflict somewhere? or is it too new?" on home#173.

Needs

  1. A/B confirm: rebuild hero_voice with ort = "=2.0.0-rc.11" (the previous pin), run the same TTS calls, compare RMS pattern. If rc.11 produces clean audio for the same input, the regression is confirmed.
  2. If confirmed: decide between (a) revert to rc.11 / ONNX 1.23, (b) escalate to ort upstream / Microsoft (ONNX issue), (c) workaround at the kokoro-micro level.

Decision impact

This blocks the home#173 follow-up squash-merge — pulling hero_voice up to rc.12 was the whole point of that work. If TTS is degraded on rc.12, the right call may be back to Kristof's original instinct: stay on rc.11/ONNX 1.23 across the stack, even though it costs us the modern runtime.

Three branches currently on development_mik waiting on this verdict:

Filed during herodemo end-to-end validation 2026-04-27. Tracking against home#173.

Signed-off-by: mik-tf

## Symptom TTS audio generated via `POST /tts` (kokoro-micro 1.0.0 backend) on the ort 2.0.0-rc.12 / ONNX 1.24.4 build produces audio where **only the first 3-4 syllables are clearly audible**. The remaining 70%+ of the output WAV is near-silence (RMS < 200), with intermittent low-amplitude noise/whispers in place of the rest of the input text. The user described it as: *"1 word out of 2 is murmurs."* ## Reproduction (against herodemo, rc.12 + ONNX 1.24.4 build) Three test phrases, all with `voice=af_heart`, `speed=1.0`. RMS analyzed in 100ms windows: | Text | Duration | Silent windows (<200 RMS) | Normal windows (≥1000) | |---|---|---|---| | `Hello world.` | 2.70s | 18/27 | 5/27 | | `Testing one two three.` | 3.25s | 21/33 | 8/33 | | `This is a test of the voice synthesis system.` | 4.42s | 30/45 | 7/45 | | `The quick brown fox jumps over the lazy dog.` | 4.75s | 32/48 | 7/48 | | 18-word ONNX migration sentence | 7.50s | 55/75 | 13/75 | Pattern across all phrases: first ~1 second of speech is clear (RMS 1000-5000), then degrades into long stretches of near-silence (RMS 0-200), occasionally surfacing for the final word or two. WAV header confirms valid output format (RIFF, 24kHz mono 16-bit PCM, kokoro's expected output). ## Hypothesis ort 1.23 → 1.24 introduced a numerical regression in an operator the kokoro vocoder/decoder relies on. The model file (`kokoro_v1.0.onnx`) is identical between builds; only the runtime changed. This is exactly the failure mode Kristof was wary of when he asked *"is there conflict somewhere? or is it too new?"* on home#173. ## Needs 1. **A/B confirm**: rebuild hero_voice with `ort = "=2.0.0-rc.11"` (the previous pin), run the same TTS calls, compare RMS pattern. If rc.11 produces clean audio for the same input, the regression is confirmed. 2. **If confirmed**: decide between (a) revert to rc.11 / ONNX 1.23, (b) escalate to ort upstream / Microsoft (ONNX issue), (c) workaround at the kokoro-micro level. ## Decision impact This blocks the home#173 follow-up squash-merge — pulling hero_voice up to rc.12 was the whole point of that work. If TTS is degraded on rc.12, the right call may be back to Kristof's original instinct: stay on rc.11/ONNX 1.23 across the stack, even though it costs us the modern runtime. Three branches currently on `development_mik` waiting on this verdict: - hero_voice [`f24619b`](https://forge.ourworld.tf/lhumina_code/hero_voice/commit/f24619b) - hero_embedder [`e3e4d68`](https://forge.ourworld.tf/lhumina_code/hero_embedder/commit/e3e4d68) - hero_skills [`62af07a`](https://forge.ourworld.tf/lhumina_code/hero_skills/commit/62af07a) Filed during herodemo end-to-end validation 2026-04-27. Tracking against [home#173](https://forge.ourworld.tf/lhumina_code/home/issues/173). Signed-off-by: mik-tf
Author
Owner

Update: A/B test result — rc.12 is INNOCENT

Rebuilt hero_voice on ort = "=2.0.0-rc.11" on herodemo (kept ONNX 1.24.4 since load-dynamic doesn't care). Hit the /tts endpoint with the same two test phrases the original report used:

rc12 pangram: 932eabdada2cdbdab5e9189d5997b757
rc11 pangram: 932eabdada2cdbdab5e9189d5997b757   ← byte-identical

rc12 long:    a77d4a009cb7c4ef8019cbad717ced44
rc11 long:    a77d4a009cb7c4ef8019cbad717ced44   ← byte-identical

RMS analysis confirms — same peak (5980), same 73% silent windows, same per-window energy pattern down to the digit. The audio dropouts are not introduced by ort rc.12 / ONNX 1.24. This pre-existed on the rc.11 / ONNX 1.23.2 stack — we just never listened to long-form output.

Implication

  1. home#173 follow-up unblocked. Safe to merge the three development_mik branches (hero_voice, hero_embedder, hero_skills) — no rc.12 regression to worry about. Reverting to rc.11 would not have improved TTS quality.
  2. The audio dropout issue itself is real but lives lower in the stack — likely in kokoro-micro 1.0.0 itself (chunking, vocoder, or G2P via espeak-rs producing wrong phonemes). Not on our migration's critical path. Reframing this issue accordingly:

Reframed scope

This ticket should be retitled / repurposed to track the kokoro-micro 1.0.0 audio quality issue. Suggested rename:

[nu-demo] kokoro-micro 1.0.0 produces ~70% silence on multi-syllable input — only first ~3-4 syllables clear

Reproduction is in the original report. Investigation paths:

  • Try other voices (issue may be af_heart specific)
  • Try other speeds
  • Compare vs upstream kokoro-micro examples / canonical Python kokoro reference
  • Look at the chunk/window logic in kokoro-micro::TtsEngine (what does it do with text > N tokens)

No timeline pressure since the rc.12 work proceeds. Filing as P3 (cosmetic — TTS still produces some audio, AI Assistant remains usable for written replies).

Signed-off-by: mik-tf

## Update: A/B test result — rc.12 is INNOCENT Rebuilt hero_voice on `ort = "=2.0.0-rc.11"` on herodemo (kept ONNX 1.24.4 since `load-dynamic` doesn't care). Hit the `/tts` endpoint with the same two test phrases the original report used: ``` rc12 pangram: 932eabdada2cdbdab5e9189d5997b757 rc11 pangram: 932eabdada2cdbdab5e9189d5997b757 ← byte-identical rc12 long: a77d4a009cb7c4ef8019cbad717ced44 rc11 long: a77d4a009cb7c4ef8019cbad717ced44 ← byte-identical ``` RMS analysis confirms — same peak (5980), same 73% silent windows, same per-window energy pattern down to the digit. The audio dropouts are not introduced by ort rc.12 / ONNX 1.24. This pre-existed on the rc.11 / ONNX 1.23.2 stack — we just never listened to long-form output. ## Implication 1. **home#173 follow-up unblocked.** Safe to merge the three `development_mik` branches (hero_voice, hero_embedder, hero_skills) — no rc.12 regression to worry about. Reverting to rc.11 would not have improved TTS quality. 2. **The audio dropout issue itself is real** but lives lower in the stack — likely in kokoro-micro 1.0.0 itself (chunking, vocoder, or G2P via espeak-rs producing wrong phonemes). Not on our migration's critical path. Reframing this issue accordingly: ## Reframed scope This ticket should be retitled / repurposed to track the kokoro-micro 1.0.0 audio quality issue. Suggested rename: > `[nu-demo] kokoro-micro 1.0.0 produces ~70% silence on multi-syllable input — only first ~3-4 syllables clear` Reproduction is in the original report. Investigation paths: - Try other voices (issue may be `af_heart` specific) - Try other speeds - Compare vs upstream kokoro-micro examples / canonical Python kokoro reference - Look at the chunk/window logic in `kokoro-micro::TtsEngine` (what does it do with text > N tokens) No timeline pressure since the rc.12 work proceeds. Filing as P3 (cosmetic — TTS still produces *some* audio, AI Assistant remains usable for written replies). Signed-off-by: mik-tf
Author
Owner

Root cause + fix landed

Root cause: kokoro-micro::TtsEngine::parse_voice_style was using voice_style[0..256] as the style vector for every synthesis. Kokoro voice files (0.bin) are stored as [max_seq_len, 1, 256] f32 tables — each row is the style embedding the model expects for sentences of that token length. Reading from offset 0 always feeds style[0] (the embedding for an empty zero-token sentence), which puts the model out of distribution and produces the audible "first ~1 second clear, rest is murmurs/silence" dropout. Confirmed against upstream Kokoros (lucasjinreal/Kokoros), which does this correctly via mix_styles(name, tokens.len()).

A secondary cosmetic issue — three $$$ boundary pad tokens instead of one — was fixed in the same fork.

Fix landed

Forked kokoro-micro v1.0.0 to https://forge.ourworld.tf/lhumina_code/kokoro-micro on branch development:

Commit Fix
925cd6f v1.0.0 baseline
b8f0de2 Single BOS/EOS pad (was $$$)
98dcc6e Index voice style by token count (root cause)

hero_voice now pins to the fork via git dep — e6573f0 on development.

Verification

metric                   before fix     after fix
silent windows (<200)    32 / 48 (66%)  12 / 52 (23%)
normal windows (>=1000)   7 / 48         34 / 52
audible content span     ~600-1200ms    ~400-4400ms

User-confirmed in browser: "I confirm on herodemo.gent01.grid.tf the TTS works now".

Long-term

Next: open a PR upstream against DavidValin/kokoro-micro (GitHub) so the fork can be retired in favour of an upstream release. Will track here.

Closing this once the upstream PR is filed.

Signed-off-by: mik-tf

## Root cause + fix landed **Root cause**: `kokoro-micro::TtsEngine::parse_voice_style` was using `voice_style[0..256]` as the style vector for every synthesis. Kokoro voice files (`0.bin`) are stored as `[max_seq_len, 1, 256]` f32 tables — each row is the style embedding the model expects for sentences of that token length. Reading from offset 0 always feeds `style[0]` (the embedding for an *empty* zero-token sentence), which puts the model out of distribution and produces the audible "first ~1 second clear, rest is murmurs/silence" dropout. Confirmed against upstream Kokoros (lucasjinreal/Kokoros), which does this correctly via `mix_styles(name, tokens.len())`. A secondary cosmetic issue — three `$$$` boundary pad tokens instead of one — was fixed in the same fork. ## Fix landed Forked kokoro-micro v1.0.0 to https://forge.ourworld.tf/lhumina_code/kokoro-micro on branch `development`: | Commit | Fix | |---|---| | [`925cd6f`](https://forge.ourworld.tf/lhumina_code/kokoro-micro/commit/925cd6f) | v1.0.0 baseline | | [`b8f0de2`](https://forge.ourworld.tf/lhumina_code/kokoro-micro/commit/b8f0de2) | Single BOS/EOS pad (was `$$$`) | | [`98dcc6e`](https://forge.ourworld.tf/lhumina_code/kokoro-micro/commit/98dcc6e) | Index voice style by token count (root cause) | hero_voice now pins to the fork via git dep — [`e6573f0`](https://forge.ourworld.tf/lhumina_code/hero_voice/commit/e6573f0) on development. ## Verification ``` metric before fix after fix silent windows (<200) 32 / 48 (66%) 12 / 52 (23%) normal windows (>=1000) 7 / 48 34 / 52 audible content span ~600-1200ms ~400-4400ms ``` User-confirmed in browser: *"I confirm on herodemo.gent01.grid.tf the TTS works now"*. ## Long-term Next: open a PR upstream against `DavidValin/kokoro-micro` (GitHub) so the fork can be retired in favour of an upstream release. Will track here. Closing this once the upstream PR is filed. Signed-off-by: mik-tf
Author
Owner

Upstream PR opened

https://github.com/DavidValin/kokoro-micro/pull/2 — both fixes rebased on the upstream main branch (e59ca2d), src/lib.rs only, public API unchanged.

Closing — fork serves the live fix and the upstream PR will let us retire the fork once it merges.

Signed-off-by: mik-tf

## Upstream PR opened https://github.com/DavidValin/kokoro-micro/pull/2 — both fixes rebased on the upstream `main` branch (`e59ca2d`), src/lib.rs only, public API unchanged. Closing — fork serves the live fix and the upstream PR will let us retire the fork once it merges. Signed-off-by: mik-tf
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/home#197
No description provided.