Voice: hybrid streaming TTS with trackbar player #89

Open
opened 2026-03-25 04:06:44 +00:00 by mik-tf · 0 comments
Owner

Context

v0.7.1-dev has TTS working (Kokoro local + Groq fallback), pause/play/stop controls, and progress tracking infrastructure in voice.rs. Currently TTS generates audio AFTER the full AI response completes — user waits for entire response before hearing anything.

Goal

Hybrid streaming TTS: hear audio sentence-by-sentence while AI is still responding, then full trackbar with seek/replay after response completes.

Implementation

Phase 1: Sentence-level SSE audio streaming

  • hero_agent: split response text into sentences as tokens stream in
  • hero_agent: send TTS for each sentence immediately via SSE event: audio (multiple events per response)
  • hero_archipelagos: queue and play audio chunks sequentially
  • hero_archipelagos: keep all chunks in a combined AudioBuffer

Phase 2: Full trackbar after response

  • Once all chunks received, stitch into single AudioBuffer
  • Render progress bar: [⏸] [━━━━●━━━━] 0:12/0:35
  • Click-to-seek on trackbar (create new BufferSource at offset)
  • Replay button (seek to 0:00)
  • Time display (elapsed / total)

Phase 3: Polish

  • Smooth progress animation (requestAnimationFrame)
  • Mini player bar below toolbar (collapses when not playing)
  • Persist play speed from Settings (0.5x-2.0x)
  • Keyboard shortcuts (space = pause, left/right = seek)

Technical notes

  • AudioBuffer.duration gives total seconds
  • AudioContext.currentTime - startTime gives elapsed
  • source.start(0, offsetSeconds) for seeking
  • AudioContext.suspend/resume for pause/play
  • Sentence splitting: split on . ! ? \n with min 20 chars
  • #78 Voice AI pipeline
  • #88 SPA/WASM migration
## Context v0.7.1-dev has TTS working (Kokoro local + Groq fallback), pause/play/stop controls, and progress tracking infrastructure in voice.rs. Currently TTS generates audio AFTER the full AI response completes — user waits for entire response before hearing anything. ## Goal Hybrid streaming TTS: hear audio sentence-by-sentence while AI is still responding, then full trackbar with seek/replay after response completes. ## Implementation ### Phase 1: Sentence-level SSE audio streaming - [ ] hero_agent: split response text into sentences as tokens stream in - [ ] hero_agent: send TTS for each sentence immediately via SSE `event: audio` (multiple events per response) - [ ] hero_archipelagos: queue and play audio chunks sequentially - [ ] hero_archipelagos: keep all chunks in a combined AudioBuffer ### Phase 2: Full trackbar after response - [ ] Once all chunks received, stitch into single AudioBuffer - [ ] Render progress bar: `[⏸] [━━━━●━━━━] 0:12/0:35` - [ ] Click-to-seek on trackbar (create new BufferSource at offset) - [ ] Replay button (seek to 0:00) - [ ] Time display (elapsed / total) ### Phase 3: Polish - [ ] Smooth progress animation (requestAnimationFrame) - [ ] Mini player bar below toolbar (collapses when not playing) - [ ] Persist play speed from Settings (0.5x-2.0x) - [ ] Keyboard shortcuts (space = pause, left/right = seek) ## Technical notes - AudioBuffer.duration gives total seconds - AudioContext.currentTime - startTime gives elapsed - source.start(0, offsetSeconds) for seeking - AudioContext.suspend/resume for pause/play - Sentence splitting: split on `. ` `! ` `? ` `\n` with min 20 chars ## Related - #78 Voice AI pipeline - #88 SPA/WASM migration
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/home#89
No description provided.