Voice AI Phase 2: cross-browser wake word + local Whisper STT #78

Closed
opened 2026-03-23 13:16:32 +00:00 by mik-tf · 14 comments
Owner

Context

Phase 1 (issue #74) delivered wake word detection using browser webkitSpeechRecognition — Chrome/Edge only. This phase makes wake word and STT work on ALL browsers by moving detection server-side.

Architecture

Browser → mic WebSocket → hero_voice
                            ├─ Silero VAD (local ONNX, already have)
                            ├─ Rustpotter wake word (local, <1MB)
                            ├─ Whisper ONNX tiny (local, 75MB)
                            └─ Groq Whisper (cloud fallback)

Level 1: Server-side wake word via Rustpotter

Rustpotter is a pure Rust wake word engine (~500KB model). Detects keywords without full STT.

  • Add rustpotter crate to hero_voice
  • Train/configure "Hero" wake word model
  • hero_voice WebSocket: run Rustpotter on incoming audio stream
  • When "Hero" detected, send {"type": "wake_word", "word": "hero"} via WebSocket
  • Browser: on wake_word message, activate conversation mode
  • Keep browser-side webkitSpeechRecognition as fallback for Chrome
  • Works on: ALL browsers, ALL platforms

Level 2: Local Whisper STT via ONNX

Replace cloud Groq dependency with local Whisper inference. We already have ort (ONNX Runtime) in hero_embedder.

  • Export Whisper tiny/small to ONNX format
  • Add ort dependency to hero_voice
  • Implement local transcription: Silero VAD → extract segment → Whisper ONNX → text
  • Auto-download model on first use (like hero_embedder does)
  • Fallback chain: local Whisper → Groq Whisper (cloud)
  • Config: HERO_VOICE_STT_LOCAL=true env var to prefer local

Level 3: Client-side WASM wake word (future)

For fully offline client-side detection (no WebSocket needed):

  • Evaluate Whisper.cpp WASM vs Porcupine WASM vs Vosk.js
  • Progressive enhancement: download model on opt-in
  • 50-75MB download — only for users who want offline

Key decisions

Decision Choice Why
Wake word engine Rustpotter (Rust, open source) Tiny model, pure Rust, no C deps, Apache licensed
Local STT Whisper ONNX via ort Already have ort in ecosystem (hero_embedder), proven
NOT Parakeet Skip NVIDIA-focused, no Rust crate, huge models
NOT Candle Whisper Skip for now Pure Rust but less mature than ort path

Existing infrastructure we reuse

  • hero_voice WebSocket (conversation mode from #74)
  • Silero VAD V5 (already in hero_voice)
  • ONNX Runtime ort crate (already in hero_embedder)
  • Browser mic streaming code (already in AI island from #74)
## Context Phase 1 (issue #74) delivered wake word detection using browser `webkitSpeechRecognition` — Chrome/Edge only. This phase makes wake word and STT work on ALL browsers by moving detection server-side. ## Architecture ``` Browser → mic WebSocket → hero_voice ├─ Silero VAD (local ONNX, already have) ├─ Rustpotter wake word (local, <1MB) ├─ Whisper ONNX tiny (local, 75MB) └─ Groq Whisper (cloud fallback) ``` ## Level 1: Server-side wake word via Rustpotter Rustpotter is a pure Rust wake word engine (~500KB model). Detects keywords without full STT. - [ ] Add `rustpotter` crate to hero_voice - [ ] Train/configure "Hero" wake word model - [ ] hero_voice WebSocket: run Rustpotter on incoming audio stream - [ ] When "Hero" detected, send `{"type": "wake_word", "word": "hero"}` via WebSocket - [ ] Browser: on wake_word message, activate conversation mode - [ ] Keep browser-side webkitSpeechRecognition as fallback for Chrome - [ ] Works on: ALL browsers, ALL platforms ## Level 2: Local Whisper STT via ONNX Replace cloud Groq dependency with local Whisper inference. We already have `ort` (ONNX Runtime) in hero_embedder. - [ ] Export Whisper tiny/small to ONNX format - [ ] Add `ort` dependency to hero_voice - [ ] Implement local transcription: Silero VAD → extract segment → Whisper ONNX → text - [ ] Auto-download model on first use (like hero_embedder does) - [ ] Fallback chain: local Whisper → Groq Whisper (cloud) - [ ] Config: `HERO_VOICE_STT_LOCAL=true` env var to prefer local ## Level 3: Client-side WASM wake word (future) For fully offline client-side detection (no WebSocket needed): - [ ] Evaluate Whisper.cpp WASM vs Porcupine WASM vs Vosk.js - [ ] Progressive enhancement: download model on opt-in - [ ] 50-75MB download — only for users who want offline ## Key decisions | Decision | Choice | Why | |----------|--------|-----| | Wake word engine | Rustpotter (Rust, open source) | Tiny model, pure Rust, no C deps, Apache licensed | | Local STT | Whisper ONNX via ort | Already have ort in ecosystem (hero_embedder), proven | | NOT Parakeet | Skip | NVIDIA-focused, no Rust crate, huge models | | NOT Candle Whisper | Skip for now | Pure Rust but less mature than ort path | ## Existing infrastructure we reuse - hero_voice WebSocket (conversation mode from #74) - Silero VAD V5 (already in hero_voice) - ONNX Runtime `ort` crate (already in hero_embedder) - Browser mic streaming code (already in AI island from #74)
Author
Owner

Phasing update

This is next round work. Current round focuses on fixing basics first:

  • Read aloud (speechSynthesis user gesture)
  • Conversation persistence (API format)
  • uv for MCP execute_code
  • Convo AudioContext + AudioWorkletNode

Once basics work → Phase 2 (this issue) for cross-browser wake word + local Whisper.

## Phasing update This is **next round** work. Current round focuses on fixing basics first: - Read aloud (speechSynthesis user gesture) - Conversation persistence (API format) - uv for MCP execute_code - Convo AudioContext + AudioWorkletNode Once basics work → Phase 2 (this issue) for cross-browser wake word + local Whisper.
Author
Owner

Starting now — server-side audio stack

#80 delivered conversation CRUD, voice input, auto-scroll, and web-sys foundation. TTS playback deferred here because server-side audio is the production solution (no browser gesture fights).

Deliverables

  1. Rustpotter wake word (hero_voice)

    • Pure Rust wake word engine, no browser API dependency
    • Detect "Hero" / "Hey Hero" server-side from WebSocket audio stream
    • Send {"type":"wake_word"} to client → triggers conversation mode
    • Works ALL browsers (Chrome, Firefox, Safari, Brave)
  2. Local Whisper STT (hero_voice)

    • ort crate + Whisper tiny ONNX model
    • Fallback chain: local ONNX → Groq cloud
    • Zero-latency for short utterances
    • Model pre-downloaded in Docker image (like BGE embedder)
  3. Server TTS via WebSocket (hero_agent + hero_voice)

    • Server generates TTS audio (gpt-4o-mini-tts via aibroker)
    • Sends audio frames back via WebSocket
    • Client plays via AudioContext (no gesture needed — WebSocket data)
    • Eliminates ALL browser speechSynthesis issues
  4. AudioWorkletNode (hero_archipelagos)

    • Replace deprecated ScriptProcessorNode in convo mode
    • Modern Web Audio API, runs on audio thread
    • Better performance, no main thread blocking

Repos

  • hero_voice — Rustpotter, Whisper ONNX
  • hero_agent — server TTS WebSocket endpoint
  • hero_archipelagos — AudioWorkletNode, WebSocket TTS playback
  • hero_services — Dockerfile (ONNX runtime, Rustpotter model)

Build

make dist-clean-wasm (island changes) + model downloads in Docker

Signed-off-by: mik-tf

## Starting now — server-side audio stack #80 delivered conversation CRUD, voice input, auto-scroll, and web-sys foundation. TTS playback deferred here because server-side audio is the production solution (no browser gesture fights). ### Deliverables 1. **Rustpotter wake word** (hero_voice) - Pure Rust wake word engine, no browser API dependency - Detect "Hero" / "Hey Hero" server-side from WebSocket audio stream - Send `{"type":"wake_word"}` to client → triggers conversation mode - Works ALL browsers (Chrome, Firefox, Safari, Brave) 2. **Local Whisper STT** (hero_voice) - `ort` crate + Whisper tiny ONNX model - Fallback chain: local ONNX → Groq cloud - Zero-latency for short utterances - Model pre-downloaded in Docker image (like BGE embedder) 3. **Server TTS via WebSocket** (hero_agent + hero_voice) - Server generates TTS audio (gpt-4o-mini-tts via aibroker) - Sends audio frames back via WebSocket - Client plays via AudioContext (no gesture needed — WebSocket data) - Eliminates ALL browser speechSynthesis issues 4. **AudioWorkletNode** (hero_archipelagos) - Replace deprecated ScriptProcessorNode in convo mode - Modern Web Audio API, runs on audio thread - Better performance, no main thread blocking ### Repos - hero_voice — Rustpotter, Whisper ONNX - hero_agent — server TTS WebSocket endpoint - hero_archipelagos — AudioWorkletNode, WebSocket TTS playback - hero_services — Dockerfile (ONNX runtime, Rustpotter model) ### Build `make dist-clean-wasm` (island changes) + model downloads in Docker Signed-off-by: mik-tf
Author
Owner

Ready to start — bundled scope

Current state (v0.7.5-dev on herodev)

Working: SSE chat, multiple messages, voice transcription, conversations CRUD, error dismiss, transcribing text reset, MCP 62 tools, 21/22 build

Broken: auto-scroll (#84), read aloud/TTS (#78), wake word Firefox (#78)

Deliverables for #78 (includes #84 fix)

1. Auto-scroll fix (#84) — 15 min

  • File: hero_archipelagos/archipelagos/intelligence/ai/src/views/message_list.rs
  • Current: set_scroll_top fires before DOM update
  • Fix: wrap in request_animation_frame callback via web_sys::window().request_animation_frame() + Closure::once
  • Pure Rust, no JS eval

2. Server-side TTS via WebSocket — hero_agent + hero_voice

  • hero_agent: add TTS to the SSE response or WebSocket — after event: done, generate audio via aibroker (gpt-4o-mini-tts) and send base64 audio
  • hero_archipelagos AI island: receive audio in the SSE/WS handler, play via AudioContext (already created in voice.rs)
  • No browser gesture needed — audio data comes from server

3. Rustpotter wake word — hero_voice

  • Add rustpotter crate to hero_voice
  • Train/download .rpw model for "Hero" / "Hey Hero"
  • Detect keyword in WebSocket audio stream server-side
  • Send {"type":"wake_word"} to client
  • Works ALL browsers (Chrome, Firefox, Safari, Brave)

4. Local Whisper STT (ONNX) — hero_voice

  • Add ort crate + Whisper tiny ONNX model
  • Fallback chain: local ONNX → Groq cloud
  • Pre-download model in Docker image (like BGE embedder)

Architecture

Browser                         Server
  ├─ mic audio (PCM) ──────────► hero_voice WebSocket
  │                              ├─ Rustpotter: wake word?
  │                              ├─ Whisper ONNX: transcribe
  │◄── {"transcription":".."} ──┤
  ├─ chat (HTTP SSE) ──────────► hero_agent /api/chat
  │◄── SSE tokens ─────────────┤
  │◄── event: done + audio ────┤ (base64 TTS audio)
  ├─ AudioContext.play() ──────│ (no gesture needed)

Files to modify

  • hero_archipelagos/archipelagos/intelligence/ai/src/views/message_list.rs — auto-scroll
  • hero_archipelagos/archipelagos/intelligence/ai/src/island.rs — TTS audio playback from SSE
  • hero_archipelagos/archipelagos/intelligence/ai/src/voice.rs — play received audio
  • hero_agent/crates/hero_agent_server/src/routes.rs — add TTS to done event
  • hero_voice/crates/hero_voice/src/ — Rustpotter, Whisper ONNX
  • hero_services/docker/build-local.sh — model downloads

Build

make dist-clean-wasm (island changes) + model downloads in Docker

Pipeline

branch → code → build → make test-local 20/20 → squash merge → deploy → verify

Signed-off-by: mik-tf

## Ready to start — bundled scope ### Current state (v0.7.5-dev on herodev) **Working**: SSE chat, multiple messages, voice transcription, conversations CRUD, error dismiss, transcribing text reset, MCP 62 tools, 21/22 build **Broken**: auto-scroll (#84), read aloud/TTS (#78), wake word Firefox (#78) ### Deliverables for #78 (includes #84 fix) #### 1. Auto-scroll fix (#84) — 15 min - File: `hero_archipelagos/archipelagos/intelligence/ai/src/views/message_list.rs` - Current: `set_scroll_top` fires before DOM update - Fix: wrap in `request_animation_frame` callback via `web_sys::window().request_animation_frame()` + `Closure::once` - Pure Rust, no JS eval #### 2. Server-side TTS via WebSocket — hero_agent + hero_voice - hero_agent: add TTS to the SSE response or WebSocket — after `event: done`, generate audio via aibroker (gpt-4o-mini-tts) and send base64 audio - hero_archipelagos AI island: receive audio in the SSE/WS handler, play via AudioContext (already created in voice.rs) - No browser gesture needed — audio data comes from server #### 3. Rustpotter wake word — hero_voice - Add `rustpotter` crate to hero_voice - Train/download `.rpw` model for "Hero" / "Hey Hero" - Detect keyword in WebSocket audio stream server-side - Send `{"type":"wake_word"}` to client - Works ALL browsers (Chrome, Firefox, Safari, Brave) #### 4. Local Whisper STT (ONNX) — hero_voice - Add `ort` crate + Whisper tiny ONNX model - Fallback chain: local ONNX → Groq cloud - Pre-download model in Docker image (like BGE embedder) ### Architecture ``` Browser Server ├─ mic audio (PCM) ──────────► hero_voice WebSocket │ ├─ Rustpotter: wake word? │ ├─ Whisper ONNX: transcribe │◄── {"transcription":".."} ──┤ ├─ chat (HTTP SSE) ──────────► hero_agent /api/chat │◄── SSE tokens ─────────────┤ │◄── event: done + audio ────┤ (base64 TTS audio) ├─ AudioContext.play() ──────│ (no gesture needed) ``` ### Files to modify - `hero_archipelagos/archipelagos/intelligence/ai/src/views/message_list.rs` — auto-scroll - `hero_archipelagos/archipelagos/intelligence/ai/src/island.rs` — TTS audio playback from SSE - `hero_archipelagos/archipelagos/intelligence/ai/src/voice.rs` — play received audio - `hero_agent/crates/hero_agent_server/src/routes.rs` — add TTS to done event - `hero_voice/crates/hero_voice/src/` — Rustpotter, Whisper ONNX - `hero_services/docker/build-local.sh` — model downloads ### Build `make dist-clean-wasm` (island changes) + model downloads in Docker ### Pipeline branch → code → build → `make test-local` 20/20 → squash merge → deploy → verify Signed-off-by: mik-tf
Author
Owner

Implementation Design — Decided & In Progress

Full voice AI pipeline for v0.7.5-dev. All 6 deliverables coded, pending build + test.

Architecture

Browser (any)                          Server
  │                                      │
  ├─ mic audio (PCM 16kHz) ────────────► hero_voice WebSocket
  │                                      ├─ Rustpotter: "Hey Hero"? → wake_word msg
  │◄── {"type":"wake_word"} ────────────┤
  │  → activates conversation mode       │
  │                                      ├─ VAD → Whisper local/Groq → text
  │◄── {"type":"transcription"} ────────┤
  │  → stops recording                   │
  │  → injects text, sends to chat       │
  │                                      │
  ├─ POST /api/chat {voice:"alloy"} ───► hero_agent
  │◄── SSE event: token ───────────────┤ (streaming)
  │◄── SSE event: done ────────────────┤ (final response)
  │◄── SSE event: audio ──────────────┤ (base64 MP3 TTS)
  │  → AudioContext.play()              │
  │  → onended → resume recording       │
  │  → loop continues                   │

Deliverables

# What Files Status
D1 Auto-scroll fix (rAF) message_list.rs Done
D2 Inline TTS via SSE stream routes.rs, ai_service.rs, voice.rs, island.rs Done
D3 Rustpotter server-side wake word wakeword.rs (new), ws.rs, island.rs Done
D4 Local Whisper STT (whisper-rs) local_transcriber.rs (new), ws.rs, build-local.sh Done
D5 Conversation mode end-to-end island.rs, voice.rs Done
D6 Read-aloud integration voice.rs, island.rs Done

Key Decisions

Decision Choice Why
Wake word engine Rustpotter 3.x (pure Rust) Tiny model, Apache licensed, no C deps
Wake word training Self-training at startup via TTS Zero manual steps — generates 5 voice samples via aibroker TTS, trains model programmatically, caches .rpw on disk
Local STT whisper-rs (whisper.cpp bindings) Proven, fast, 75MB model (ggml-tiny.en)
STT fallback Local Whisper → Groq cloud HERO_VOICE_STT_LOCAL=true enables local
TTS delivery Inline via SSE event: audio No extra HTTP round-trip, audio arrives with response
Conversation loop stop recording → chat → TTS → onended → resume recording Echo-free, natural flow
Auto-scroll requestAnimationFrame wrapper Fires after DOM paint, no race condition

Repos Touched

  • hero_archipelagos — 4 files (island, voice, ai_service, message_list)
  • hero_agent — 1 file (routes.rs)
  • hero_voice — 5 files + 2 new (wakeword.rs, local_transcriber.rs)
  • hero_services — 1 file (build-local.sh)

Build

make dist-clean-wasm required (island changes + new server modules).

Signed-off-by: mik-tf

## Implementation Design — Decided & In Progress Full voice AI pipeline for v0.7.5-dev. All 6 deliverables coded, pending build + test. ### Architecture ``` Browser (any) Server │ │ ├─ mic audio (PCM 16kHz) ────────────► hero_voice WebSocket │ ├─ Rustpotter: "Hey Hero"? → wake_word msg │◄── {"type":"wake_word"} ────────────┤ │ → activates conversation mode │ │ ├─ VAD → Whisper local/Groq → text │◄── {"type":"transcription"} ────────┤ │ → stops recording │ │ → injects text, sends to chat │ │ │ ├─ POST /api/chat {voice:"alloy"} ───► hero_agent │◄── SSE event: token ───────────────┤ (streaming) │◄── SSE event: done ────────────────┤ (final response) │◄── SSE event: audio ──────────────┤ (base64 MP3 TTS) │ → AudioContext.play() │ │ → onended → resume recording │ │ → loop continues │ ``` ### Deliverables | # | What | Files | Status | |---|------|-------|--------| | D1 | Auto-scroll fix (rAF) | `message_list.rs` | ✅ Done | | D2 | Inline TTS via SSE stream | `routes.rs`, `ai_service.rs`, `voice.rs`, `island.rs` | ✅ Done | | D3 | Rustpotter server-side wake word | `wakeword.rs` (new), `ws.rs`, `island.rs` | ✅ Done | | D4 | Local Whisper STT (whisper-rs) | `local_transcriber.rs` (new), `ws.rs`, `build-local.sh` | ✅ Done | | D5 | Conversation mode end-to-end | `island.rs`, `voice.rs` | ✅ Done | | D6 | Read-aloud integration | `voice.rs`, `island.rs` | ✅ Done | ### Key Decisions | Decision | Choice | Why | |----------|--------|-----| | Wake word engine | Rustpotter 3.x (pure Rust) | Tiny model, Apache licensed, no C deps | | Wake word training | **Self-training at startup via TTS** | Zero manual steps — generates 5 voice samples via aibroker TTS, trains model programmatically, caches `.rpw` on disk | | Local STT | whisper-rs (whisper.cpp bindings) | Proven, fast, 75MB model (ggml-tiny.en) | | STT fallback | Local Whisper → Groq cloud | `HERO_VOICE_STT_LOCAL=true` enables local | | TTS delivery | Inline via SSE `event: audio` | No extra HTTP round-trip, audio arrives with response | | Conversation loop | stop recording → chat → TTS → onended → resume recording | Echo-free, natural flow | | Auto-scroll | `requestAnimationFrame` wrapper | Fires after DOM paint, no race condition | ### Repos Touched - `hero_archipelagos` — 4 files (island, voice, ai_service, message_list) - `hero_agent` — 1 file (routes.rs) - `hero_voice` — 5 files + 2 new (wakeword.rs, local_transcriber.rs) - `hero_services` — 1 file (build-local.sh) ### Build `make dist-clean-wasm` required (island changes + new server modules). Signed-off-by: mik-tf
Author
Owner

Updated Design — Wake Word UX (Industry Standard Pattern)

Two distinct voice modes

Mode Trigger Behavior End
Wake command "Hey Hero" Chime → one command → one response → done After TTS plays
Conversation Click Convo button Continuous loop: listen → respond → listen User clicks OFF

Wake command flow

User: "Hey Hero, what services are running?"
  → VAD + Whisper transcribes full utterance
  → Server detects "hey hero" prefix
  → Strips prefix → command: "what services are running?"
  → Sends to chat as message
  → TTS plays the answer
  → Back to passive wake listening

User: "Hey Hero" (pause)
  → Server detects wake word, no command attached
  → Plays short audio chime
  → Listens for next speech segment (5s timeout)
  → Sends that as chat message
  → TTS plays the answer
  → Back to passive wake listening

Key decisions

Decision Choice Why
Confirmation Short chime (not spoken greeting) Fast, non-verbal, matches Alexa/Google/Siri
"Hey Hero + command" Process immediately User says it as one phrase, no round-trip wait
"Hey Hero" alone Chime + listen 5s Matches industry pattern
After response Back to passive listening NOT conversation mode — that's separate
Detection engine Whisper (Groq cloud + local) Already have it, all browsers, no new deps

Server protocol (ws.rs)

  • New client message: {type: "listen"} — passive mode, VAD+transcribe, only wake detection
  • Wake detected with command: {type: "wake_word", command: "what services are running"}
  • Wake detected alone: {type: "wake_word", command: null}
  • After wake: auto-switches to recording mode for follow-up if no command

Implementation status

# What Status
D1 Auto-scroll (rAF) Built
D2 Inline TTS via SSE Built
D3 Wake word via Whisper (all browsers) Building now
D4 Local Whisper STT Built
D5 Conversation mode loop Built
D6 Read-aloud Built
D7 Wake command UX (chime + one-shot) Building now

Build: 21/22 (only hero_compute fails, pre-existing #83)

Signed-off-by: mik-tf

## Updated Design — Wake Word UX (Industry Standard Pattern) ### Two distinct voice modes | Mode | Trigger | Behavior | End | |------|---------|----------|-----| | **Wake command** | "Hey Hero" | Chime → one command → one response → done | After TTS plays | | **Conversation** | Click Convo button | Continuous loop: listen → respond → listen | User clicks OFF | ### Wake command flow ``` User: "Hey Hero, what services are running?" → VAD + Whisper transcribes full utterance → Server detects "hey hero" prefix → Strips prefix → command: "what services are running?" → Sends to chat as message → TTS plays the answer → Back to passive wake listening User: "Hey Hero" (pause) → Server detects wake word, no command attached → Plays short audio chime → Listens for next speech segment (5s timeout) → Sends that as chat message → TTS plays the answer → Back to passive wake listening ``` ### Key decisions | Decision | Choice | Why | |----------|--------|-----| | Confirmation | Short chime (not spoken greeting) | Fast, non-verbal, matches Alexa/Google/Siri | | "Hey Hero + command" | Process immediately | User says it as one phrase, no round-trip wait | | "Hey Hero" alone | Chime + listen 5s | Matches industry pattern | | After response | Back to passive listening | NOT conversation mode — that's separate | | Detection engine | Whisper (Groq cloud + local) | Already have it, all browsers, no new deps | ### Server protocol (ws.rs) - New client message: `{type: "listen"}` — passive mode, VAD+transcribe, only wake detection - Wake detected with command: `{type: "wake_word", command: "what services are running"}` - Wake detected alone: `{type: "wake_word", command: null}` - After wake: auto-switches to recording mode for follow-up if no command ### Implementation status | # | What | Status | |---|------|--------| | D1 | Auto-scroll (rAF) | Built | | D2 | Inline TTS via SSE | Built | | D3 | Wake word via Whisper (all browsers) | Building now | | D4 | Local Whisper STT | Built | | D5 | Conversation mode loop | Built | | D6 | Read-aloud | Built | | D7 | Wake command UX (chime + one-shot) | Building now | Build: 21/22 (only hero_compute fails, pre-existing #83) Signed-off-by: mik-tf
Author
Owner

v0.7.6-dev Status — Voice Pipeline

Deployed on herodev

  • Build: 21/22 (hero_compute #83)
  • Integration: 20/20
  • Smoke: 111/118 (5 pre-existing)

What's built and compiles

Feature Status Files
Auto-scroll (rAF) Built message_list.rs
Inline TTS via SSE Built routes.rs, ai_service.rs, voice.rs
Wake word (browser) Built island.rs — Chrome/Edge
Wake word (server Whisper) Built ws.rs listen mode — all browsers
Local Whisper STT Built local_transcriber.rs (whisper-rs)
Conversation mode loop Built island.rs, voice.rs
Read-aloud Built voice.rs, island.rs
Wake command UX Built "Hey Hero + command" or chime

What's broken — TTS runtime

hero_agent calls Groq Orpheus TTS via reqwest but gets 401 Unauthorized inside Docker. Same API key works with curl from the same container. Root cause: reqwest uses hyper-rustls TLS backend which behaves differently from OpenSSL/curl for the Groq API auth.

Fix options for next session:

  1. Switch hero_agent reqwest to native-tls (uses OpenSSL)
  2. Route TTS through aibroker (fix aibroker TTS service for Groq)
  3. Add TTS endpoint to hero_voice_ui (its HTTP client already works for Groq Whisper)

What's blocked — dependency conflicts

Feature Dep Issue Fix
Rustpotter wake word candle-core 0.2.2 half/rand conflict Fork, update candle to 0.8+
Kokoro local TTS kokoro-micro ort version conflict with VAD Update voice_activity_detector to ort rc.11

Next steps

  1. Fix TTS (option 1 or 3 — ~30 min)
  2. Verify read-aloud, wake word, conversation mode end-to-end
  3. Fork voice_activity_detector for ort alignment → unlocks Kokoro + Rustpotter

Signed-off-by: mik-tf

## v0.7.6-dev Status — Voice Pipeline ### Deployed on herodev - Build: 21/22 (hero_compute #83) - Integration: 20/20 - Smoke: 111/118 (5 pre-existing) ### What's built and compiles | Feature | Status | Files | |---------|--------|-------| | Auto-scroll (rAF) | Built | message_list.rs | | Inline TTS via SSE | Built | routes.rs, ai_service.rs, voice.rs | | Wake word (browser) | Built | island.rs — Chrome/Edge | | Wake word (server Whisper) | Built | ws.rs listen mode — all browsers | | Local Whisper STT | Built | local_transcriber.rs (whisper-rs) | | Conversation mode loop | Built | island.rs, voice.rs | | Read-aloud | Built | voice.rs, island.rs | | Wake command UX | Built | "Hey Hero + command" or chime | ### What's broken — TTS runtime hero_agent calls Groq Orpheus TTS via `reqwest` but gets **401 Unauthorized** inside Docker. Same API key works with `curl` from the same container. Root cause: `reqwest` uses `hyper-rustls` TLS backend which behaves differently from OpenSSL/curl for the Groq API auth. **Fix options for next session:** 1. Switch hero_agent reqwest to `native-tls` (uses OpenSSL) 2. Route TTS through aibroker (fix aibroker TTS service for Groq) 3. Add TTS endpoint to hero_voice_ui (its HTTP client already works for Groq Whisper) ### What's blocked — dependency conflicts | Feature | Dep | Issue | Fix | |---------|-----|-------|-----| | Rustpotter wake word | candle-core 0.2.2 | half/rand conflict | Fork, update candle to 0.8+ | | Kokoro local TTS | kokoro-micro | ort version conflict with VAD | Update voice_activity_detector to ort rc.11 | ### Next steps 1. Fix TTS (option 1 or 3 — ~30 min) 2. Verify read-aloud, wake word, conversation mode end-to-end 3. Fork voice_activity_detector for ort alignment → unlocks Kokoro + Rustpotter Signed-off-by: mik-tf
Author
Owner

Updated Plan — Voice Pipeline Final Architecture

Settings UX

Tab Content Status
Appearance Theme, borders, background Exists
Voice & Audio TTS provider, voice, speed, auto-read, wake word New tab
Environment API keys (Groq, OpenRouter, etc.) — powers voice when Groq selected Exists

API keys stay ONLY in Environment tab (already has Groq). Voice tab picks which provider to USE. No duplication.

Voice & Audio tab layout

VOICE & AUDIO

TTS Provider    [Kokoro (Local)] ▾
                 Kokoro (Local) — free, private, no API key
                 Groq Orpheus — needs Groq API key (set in Environment)
                 Browser — built-in, no setup

Voice           [Diana] ▾  (changes per provider)
Speed           [1.0x] ▾
Auto-read       [ON/OFF]  (AI responses spoken aloud)
Wake word       [ON/OFF]  (say "Hey Hero" to activate)

TTS priority: local first, cloud fallback

Priority Provider Cost Quality Privacy
1st Kokoro (local ONNX) Free Near-human 100% private
2nd Groq Orpheus (cloud) Free tier Expressive Text sent to Groq
3rd Browser speechSynthesis Free Robotic 100% local

OS-wide voice service

hero_voice_ui becomes the voice gateway. Any island calls POST /hero_voice_ui/api/tts with text + provider preference → gets audio back.

App Voice feature Status
AI Assistant Read-aloud, wake word, conversation mode This issue
Hero Books Read chapters aloud Future issue (after voice confirmed working)
Communication Voice messages, transcription Exists

Implementation steps

# What Effort Unblocks
1 Fix reqwest native-tls in hero_agent 10 min TTS works NOW
2 Fork voice_activity_detector → pin ort to rc.11 1-2 hours Kokoro + Rustpotter
3 Integrate Kokoro TTS in hero_voice_ui 2 hours Local TTS, zero-config
4 Add Voice & Audio tab to Settings page 2-3 hours User preferences
5 Wire settings (localStorage) → AI island + voice service 1 hour OS-wide
6 Fix aibroker TTS (remove unsafe cast) 2 hours Clean routing
7 Add read-aloud to Hero Books Separate issue Reuses voice service

Signed-off-by: mik-tf

## Updated Plan — Voice Pipeline Final Architecture ### Settings UX | Tab | Content | Status | |-----|---------|--------| | Appearance | Theme, borders, background | Exists | | **Voice & Audio** | TTS provider, voice, speed, auto-read, wake word | **New tab** | | Environment | API keys (Groq, OpenRouter, etc.) — powers voice when Groq selected | Exists | API keys stay ONLY in Environment tab (already has Groq). Voice tab picks which provider to USE. No duplication. ### Voice & Audio tab layout ``` VOICE & AUDIO TTS Provider [Kokoro (Local)] ▾ Kokoro (Local) — free, private, no API key Groq Orpheus — needs Groq API key (set in Environment) Browser — built-in, no setup Voice [Diana] ▾ (changes per provider) Speed [1.0x] ▾ Auto-read [ON/OFF] (AI responses spoken aloud) Wake word [ON/OFF] (say "Hey Hero" to activate) ``` ### TTS priority: local first, cloud fallback | Priority | Provider | Cost | Quality | Privacy | |----------|----------|------|---------|---------| | 1st | Kokoro (local ONNX) | Free | Near-human | 100% private | | 2nd | Groq Orpheus (cloud) | Free tier | Expressive | Text sent to Groq | | 3rd | Browser speechSynthesis | Free | Robotic | 100% local | ### OS-wide voice service hero_voice_ui becomes the voice gateway. Any island calls `POST /hero_voice_ui/api/tts` with text + provider preference → gets audio back. | App | Voice feature | Status | |-----|--------------|--------| | AI Assistant | Read-aloud, wake word, conversation mode | This issue | | Hero Books | Read chapters aloud | Future issue (after voice confirmed working) | | Communication | Voice messages, transcription | Exists | ### Implementation steps | # | What | Effort | Unblocks | |---|------|--------|----------| | 1 | Fix reqwest native-tls in hero_agent | 10 min | TTS works NOW | | 2 | Fork voice_activity_detector → pin ort to rc.11 | 1-2 hours | Kokoro + Rustpotter | | 3 | Integrate Kokoro TTS in hero_voice_ui | 2 hours | Local TTS, zero-config | | 4 | Add Voice & Audio tab to Settings page | 2-3 hours | User preferences | | 5 | Wire settings (localStorage) → AI island + voice service | 1 hour | OS-wide | | 6 | Fix aibroker TTS (remove unsafe cast) | 2 hours | Clean routing | | 7 | Add read-aloud to Hero Books | Separate issue | Reuses voice service | Signed-off-by: mik-tf
Author
Owner

Consolidated Plan — Complete Voice Pipeline

Root cause: ort version alignment

One dependency fix unblocks BOTH Kokoro local TTS and Rustpotter server-side wake word:

voice_activity_detector 0.2 → uses ort 2.0.0-rc.6
kokoro-micro 1.0           → uses ort 2.0.0-rc.11
rustpotter 3.0 → candle-core → half/rand conflict (separate but related)

Fix: Fork voice_activity_detector, pin to ort 2.0.0-rc.11. This unblocks Kokoro immediately. Rustpotter needs a separate fork (candle-core → 0.8+).

Full local voice pipeline (target)

Mic → Silero VAD (local) → Whisper STT (local) → LLM (cloud) → Kokoro TTS (local) → speaker

Zero API calls for voice processing. Only the LLM requires cloud.

Settings page — Voice & Audio tab (NEW)

Added between Appearance and Environment tabs:

VOICE & AUDIO

Text-to-Speech
  Provider    [Kokoro (Local)] ▾
               Kokoro (Local) — free, private
               Groq Orpheus — uses Groq API key (set in Environment tab)
               Browser — built-in
  Voice       [Diana] ▾
  Speed       [1.0x] ▾

Speech-to-Text
  Provider    [Whisper (Local)] ▾
               Whisper (Local) — free, private, 75MB model
               Groq Whisper — uses Groq API key (faster)
  Language    [English] ▾

General
  Auto-read   [ON/OFF]
  Wake word   [ON/OFF]

API keys stay in Environment tab only (already has Groq). No duplication.

Implementation steps (ordered)

# What Status Unblocks
1 Fix reqwest native-tls in hero_agent Done Groq TTS works
2 Fix voice names (diana/austin/hannah/autumn) Done Groq Orpheus voices
3 WASM rebuild with all voice code Deploying now Browser-side features
4 Fork voice_activity_detector → ort rc.11 TODO Kokoro + Rustpotter
5 Integrate Kokoro TTS in hero_voice_ui TODO Local TTS, zero-config
6 Enable local Whisper STT (HERO_VOICE_STT_LOCAL=true) Code built, needs env var Local STT
7 Add Voice & Audio tab to Settings TODO User preferences
8 Wire settings to AI island + voice service TODO OS-wide voice
9 Fix aibroker TTS (remove unsafe cast) TODO Clean routing
10 Fork rustpotter → candle-core 0.8+ TODO Server-side wake word (all browsers)
  • #83 hero_compute build failure (pre-existing)
  • #85 Hero Books read-aloud (created, depends on this)

Signed-off-by: mik-tf

## Consolidated Plan — Complete Voice Pipeline ### Root cause: `ort` version alignment One dependency fix unblocks BOTH Kokoro local TTS and Rustpotter server-side wake word: ``` voice_activity_detector 0.2 → uses ort 2.0.0-rc.6 kokoro-micro 1.0 → uses ort 2.0.0-rc.11 rustpotter 3.0 → candle-core → half/rand conflict (separate but related) ``` **Fix:** Fork `voice_activity_detector`, pin to `ort 2.0.0-rc.11`. This unblocks Kokoro immediately. Rustpotter needs a separate fork (candle-core → 0.8+). ### Full local voice pipeline (target) ``` Mic → Silero VAD (local) → Whisper STT (local) → LLM (cloud) → Kokoro TTS (local) → speaker ``` Zero API calls for voice processing. Only the LLM requires cloud. ### Settings page — Voice & Audio tab (NEW) Added between Appearance and Environment tabs: ``` VOICE & AUDIO Text-to-Speech Provider [Kokoro (Local)] ▾ Kokoro (Local) — free, private Groq Orpheus — uses Groq API key (set in Environment tab) Browser — built-in Voice [Diana] ▾ Speed [1.0x] ▾ Speech-to-Text Provider [Whisper (Local)] ▾ Whisper (Local) — free, private, 75MB model Groq Whisper — uses Groq API key (faster) Language [English] ▾ General Auto-read [ON/OFF] Wake word [ON/OFF] ``` API keys stay in Environment tab only (already has Groq). No duplication. ### Implementation steps (ordered) | # | What | Status | Unblocks | |---|------|--------|----------| | 1 | Fix reqwest native-tls in hero_agent | **Done** | Groq TTS works | | 2 | Fix voice names (diana/austin/hannah/autumn) | **Done** | Groq Orpheus voices | | 3 | WASM rebuild with all voice code | **Deploying now** | Browser-side features | | 4 | Fork voice_activity_detector → ort rc.11 | TODO | Kokoro + Rustpotter | | 5 | Integrate Kokoro TTS in hero_voice_ui | TODO | Local TTS, zero-config | | 6 | Enable local Whisper STT (HERO_VOICE_STT_LOCAL=true) | Code built, needs env var | Local STT | | 7 | Add Voice & Audio tab to Settings | TODO | User preferences | | 8 | Wire settings to AI island + voice service | TODO | OS-wide voice | | 9 | Fix aibroker TTS (remove unsafe cast) | TODO | Clean routing | | 10 | Fork rustpotter → candle-core 0.8+ | TODO | Server-side wake word (all browsers) | ### Related issues - #83 hero_compute build failure (pre-existing) - #85 Hero Books read-aloud (created, depends on this) Signed-off-by: mik-tf
Author
Owner

Session End Status — 2026-03-24

What's done (code written, compiles, on disk)

Item Files Compiles
Auto-scroll (rAF) message_list.rs Yes
Inline TTS via SSE event:audio routes.rs, ai_service.rs, voice.rs, island.rs Yes
play_base64_audio with onended voice.rs Yes
stop_tts (closes AudioContext) voice.rs Yes
Wake word (browser SpeechRecognition) island.rs Yes
Wake word (server Whisper listen mode) ws.rs, island.rs Yes
Conversation mode loop island.rs Yes
Read-aloud (auto + per-message) island.rs, voice.rs Yes
Groq Orpheus TTS (direct, native-tls) routes.rs Yes, curl verified
Local Whisper STT (whisper-rs) local_transcriber.rs Yes
Wakeword stub (for future Rustpotter) wakeword.rs Yes
Builder image with libclang+cmake Dockerfile.base Built
Whisper model download in build build-local.sh Yes
modelsconfig with TTS models build-local.sh Yes

What's NOT done

Item Why
Voice & Audio settings tab Discussed design, never coded
Kokoro local TTS ort version conflict (need fork voice_activity_detector)
Rustpotter server wake word candle-core version conflict (need fork rustpotter)
Dynamic voice names per provider Currently shows OpenAI names mapped to Groq — need provider-specific dropdowns
AIBroker TTS fix (unsafe cast) Not started
Verify WASM has all voice code in browser Build infra issues prevented clean verification

Known build issues

  1. build.rs in hero_voice regenerates lib.rs — must add modules in build.rs not lib.rs
  2. modelsconfig.yml must be manually copied to dist/var/hero_aibroker/ (SKIP_WASM builds don't run the copy step)
  3. Docker pack must run from lhumina_code/ dir, not hero_services/
  4. BUILD_IMAGE=hero-builder:bookworm required for whisper-rs (libclang)
  5. hero_compute always fails (#83)

Voice names — need fixing

Current: OpenAI names (Alloy/Echo/Fable/Shimmer) mapped to Groq (diana/austin/hannah/autumn)
Target: Dynamic dropdown from active provider:

  • Kokoro: af_heart, af_bella, am_adam, etc (54 voices)
  • Groq: diana, hannah, autumn, austin, daniel, troy
  • Browser: OS voices

Next session should

  1. Clean verify all code diffs
  2. One clean dist-clean-wasm build
  3. Pack + deploy + verify each feature in browser
  4. Fork voice_activity_detector → ort rc.11
  5. Integrate Kokoro TTS
  6. Add Voice & Audio settings tab
  7. Wire dynamic voice names per provider
  8. Enable local Whisper STT

Signed-off-by: mik-tf

## Session End Status — 2026-03-24 ### What's done (code written, compiles, on disk) | Item | Files | Compiles | |------|-------|----------| | Auto-scroll (rAF) | message_list.rs | Yes | | Inline TTS via SSE event:audio | routes.rs, ai_service.rs, voice.rs, island.rs | Yes | | play_base64_audio with onended | voice.rs | Yes | | stop_tts (closes AudioContext) | voice.rs | Yes | | Wake word (browser SpeechRecognition) | island.rs | Yes | | Wake word (server Whisper listen mode) | ws.rs, island.rs | Yes | | Conversation mode loop | island.rs | Yes | | Read-aloud (auto + per-message) | island.rs, voice.rs | Yes | | Groq Orpheus TTS (direct, native-tls) | routes.rs | Yes, curl verified | | Local Whisper STT (whisper-rs) | local_transcriber.rs | Yes | | Wakeword stub (for future Rustpotter) | wakeword.rs | Yes | | Builder image with libclang+cmake | Dockerfile.base | Built | | Whisper model download in build | build-local.sh | Yes | | modelsconfig with TTS models | build-local.sh | Yes | ### What's NOT done | Item | Why | |------|-----| | Voice & Audio settings tab | Discussed design, never coded | | Kokoro local TTS | ort version conflict (need fork voice_activity_detector) | | Rustpotter server wake word | candle-core version conflict (need fork rustpotter) | | Dynamic voice names per provider | Currently shows OpenAI names mapped to Groq — need provider-specific dropdowns | | AIBroker TTS fix (unsafe cast) | Not started | | Verify WASM has all voice code in browser | Build infra issues prevented clean verification | ### Known build issues 1. build.rs in hero_voice regenerates lib.rs — must add modules in build.rs not lib.rs 2. modelsconfig.yml must be manually copied to dist/var/hero_aibroker/ (SKIP_WASM builds don't run the copy step) 3. Docker pack must run from lhumina_code/ dir, not hero_services/ 4. BUILD_IMAGE=hero-builder:bookworm required for whisper-rs (libclang) 5. hero_compute always fails (#83) ### Voice names — need fixing Current: OpenAI names (Alloy/Echo/Fable/Shimmer) mapped to Groq (diana/austin/hannah/autumn) Target: Dynamic dropdown from active provider: - Kokoro: af_heart, af_bella, am_adam, etc (54 voices) - Groq: diana, hannah, autumn, austin, daniel, troy - Browser: OS voices ### Next session should 1. Clean verify all code diffs 2. One clean dist-clean-wasm build 3. Pack + deploy + verify each feature in browser 4. Fork voice_activity_detector → ort rc.11 5. Integrate Kokoro TTS 6. Add Voice & Audio settings tab 7. Wire dynamic voice names per provider 8. Enable local Whisper STT Signed-off-by: mik-tf
Author
Owner

Kokoro Unblocked — No Fork Needed

The ort version conflict is resolved by replacing the VAD crate, not forking anything.

Root cause

voice_activity_detector 0.2.1 → ort 2.0.0-rc.6
kokoro-micro 1.0              → ort 2.0.0-rc.11
Cargo can't resolve both → build fails

Solution: earshot 1.0.0 (pure Rust VAD)

Replace voice_activity_detector with earshot:

  • Pure Rust — zero ONNX dependency, zero ort dependency
  • 16kHz i16 input (same as current)
  • f32 probability output (same as current)
  • Frame size: 256 samples (vs 512 current — faster)
  • No fork, no maintenance burden

silero-vad-rust was considered but uses ort 2.0.0-rc.10 — still conflicts. earshot is the only conflict-free option.

Verified compatibility

earshot 1.0.0    → pure Rust, no ort     (VAD)
kokoro-micro 1.0 → ort 2.0.0-rc.11       (TTS)
whisper-rs 0.16  → whisper.cpp C bindings (STT)

Zero shared dependencies. Zero conflicts.

Full local pipeline (no API calls for voice)

Mic → earshot VAD (pure Rust) → whisper-rs STT (local) → LLM (cloud) → Kokoro TTS (local) → speaker

Signed-off-by: mik-tf

## Kokoro Unblocked — No Fork Needed The `ort` version conflict is resolved by replacing the VAD crate, not forking anything. ### Root cause ``` voice_activity_detector 0.2.1 → ort 2.0.0-rc.6 kokoro-micro 1.0 → ort 2.0.0-rc.11 Cargo can't resolve both → build fails ``` ### Solution: `earshot 1.0.0` (pure Rust VAD) Replace `voice_activity_detector` with `earshot`: - **Pure Rust** — zero ONNX dependency, zero `ort` dependency - 16kHz i16 input (same as current) - f32 probability output (same as current) - Frame size: 256 samples (vs 512 current — faster) - No fork, no maintenance burden `silero-vad-rust` was considered but uses `ort 2.0.0-rc.10` — still conflicts. `earshot` is the only conflict-free option. ### Verified compatibility ``` earshot 1.0.0 → pure Rust, no ort (VAD) kokoro-micro 1.0 → ort 2.0.0-rc.11 (TTS) whisper-rs 0.16 → whisper.cpp C bindings (STT) Zero shared dependencies. Zero conflicts. ``` ### Full local pipeline (no API calls for voice) ``` Mic → earshot VAD (pure Rust) → whisper-rs STT (local) → LLM (cloud) → Kokoro TTS (local) → speaker ``` Signed-off-by: mik-tf
Author
Owner

Phase 1-2 Progress: v0.7.0-dev deployed to herodev

Code review + fixes across 4 repos

hero_archipelagos:

  • Replaced fake OpenAI voice names (Alloy/Echo/Shimmer) with real Groq Orpheus voices (Diana/Hannah/Autumn/Austin/Daniel/Troy)
  • Fixed double-TTS race condition (inline audio + fallback both firing)
  • Fixed wake word WebSocket reconnect (was logging but never reconnecting)
  • Fixed base64 decode bug (binary.bytes()binary.chars() for bytes > 127)
  • Default voice changed from "alloy" to "diana"

hero_agent:

  • Fixed SSE audio event format mismatch (hardcoded "mp3" when Groq returns WAV)
  • Extracted orpheus_voice() helper to deduplicate voice mapping
  • Updated voice_chat endpoint to use Groq directly (was only using aibroker)
  • All 6 Orpheus voices verified working: diana, hannah, autumn, austin, daniel, troy

hero_voice:

  • Fixed blocking Whisper inference on async runtime (tokio::spawnspawn_blocking)
  • Cleaned up unused imports and variables

hero_services:

  • No changes needed (build infra was already correct)

Build & deploy

  • 21/22 builds pass (only hero_compute fails — pre-existing #83)
  • WASM builds pass (Dioxus shell + all islands including AI)
  • Remote smoke: 49/55 pass (5 failures all hero_foundry_ui redirect — pre-existing)
  • TTS endpoint verified: all 6 Groq Orpheus voices return WAV audio
  • Voice WebSocket healthy

Next: browser verification

  • Hard refresh herodev, test auto-scroll, read aloud, speaker button, wake word, convo mode
  • Then Phase 3: Kokoro + earshot + local Whisper

Signed-off-by: mik-tf

## Phase 1-2 Progress: v0.7.0-dev deployed to herodev ### Code review + fixes across 4 repos **hero_archipelagos:** - Replaced fake OpenAI voice names (Alloy/Echo/Shimmer) with real Groq Orpheus voices (Diana/Hannah/Autumn/Austin/Daniel/Troy) - Fixed double-TTS race condition (inline audio + fallback both firing) - Fixed wake word WebSocket reconnect (was logging but never reconnecting) - Fixed base64 decode bug (`binary.bytes()` → `binary.chars()` for bytes > 127) - Default voice changed from "alloy" to "diana" **hero_agent:** - Fixed SSE audio event format mismatch (hardcoded "mp3" when Groq returns WAV) - Extracted `orpheus_voice()` helper to deduplicate voice mapping - Updated voice_chat endpoint to use Groq directly (was only using aibroker) - All 6 Orpheus voices verified working: diana, hannah, autumn, austin, daniel, troy **hero_voice:** - Fixed blocking Whisper inference on async runtime (`tokio::spawn` → `spawn_blocking`) - Cleaned up unused imports and variables **hero_services:** - No changes needed (build infra was already correct) ### Build & deploy - 21/22 builds pass (only hero_compute fails — pre-existing #83) - WASM builds pass (Dioxus shell + all islands including AI) - Remote smoke: 49/55 pass (5 failures all hero_foundry_ui redirect — pre-existing) - TTS endpoint verified: all 6 Groq Orpheus voices return WAV audio - Voice WebSocket healthy ### Next: browser verification - Hard refresh herodev, test auto-scroll, read aloud, speaker button, wake word, convo mode - Then Phase 3: Kokoro + earshot + local Whisper Signed-off-by: mik-tf
Author
Owner

Phase 1-2 Complete — v0.7.0-dev released

Repos squash-merged to development

  • hero_archipelagos: 8a7a4fe
  • hero_agent: df95b38
  • hero_voice: ada6867
  • hero_services: b9e6e3d
  • hero_compute: 3b89040 (lockfile fix for #83)

Release

https://forge.ourworld.tf/lhumina_code/hero_services/releases/tag/v0.7.0-dev

Verified working

  • Read aloud (auto-read + per-message speaker)
  • 6 Groq Orpheus voices
  • SSE keepalive 5s for long tool calls
  • TTS text truncation for long responses
  • 22/22 builds, 49/55 smoke tests

Remaining (tracked in #87)

  • Stop button visibility
  • Convo mode / Wake word browser testing
  • Auto-scroll tuning
  • Phase 3: Kokoro local TTS + earshot VAD
  • Phase 4: Settings UI

Signed-off-by: mik-tf

## Phase 1-2 Complete — v0.7.0-dev released ### Repos squash-merged to development - hero_archipelagos: `8a7a4fe` - hero_agent: `df95b38` - hero_voice: `ada6867` - hero_services: `b9e6e3d` - hero_compute: `3b89040` (lockfile fix for #83) ### Release https://forge.ourworld.tf/lhumina_code/hero_services/releases/tag/v0.7.0-dev ### Verified working - Read aloud (auto-read + per-message speaker) - 6 Groq Orpheus voices - SSE keepalive 5s for long tool calls - TTS text truncation for long responses - 22/22 builds, 49/55 smoke tests ### Remaining (tracked in #87) - Stop button visibility - Convo mode / Wake word browser testing - Auto-scroll tuning - Phase 3: Kokoro local TTS + earshot VAD - Phase 4: Settings UI Signed-off-by: mik-tf
Author
Owner

v0.7.1-dev deployed to herodev

Phase 3 complete:

  • earshot VAD (pure Rust, replaces ONNX voice_activity_detector)
  • kokoro-micro TTS with 54+ voices, local, no API key
  • 3-tier TTS routing: Kokoro → Groq Orpheus → aibroker
  • Local Whisper STT enabled (HERO_VOICE_STT_LOCAL=true)
  • /tts + /tts/voices endpoints on hero_voice_ui
  • Settings Voice & Audio tab (TTS/STT provider, voice, speed)
  • Dynamic voice dropdown in AI assistant
  • Stop button, auto-scroll, convo mode fixes
  • Builder upgraded to trixie (glibc 2.41 for ort/kokoro)

Repos: hero_voice, hero_agent, hero_archipelagos, hero_os, hero_services
Tests: 112 smoke + 20 integration, 0 failures

Signed-off-by: mik-tf

## v0.7.1-dev deployed to herodev Phase 3 complete: - earshot VAD (pure Rust, replaces ONNX voice_activity_detector) - kokoro-micro TTS with 54+ voices, local, no API key - 3-tier TTS routing: Kokoro → Groq Orpheus → aibroker - Local Whisper STT enabled (HERO_VOICE_STT_LOCAL=true) - /tts + /tts/voices endpoints on hero_voice_ui - Settings Voice & Audio tab (TTS/STT provider, voice, speed) - Dynamic voice dropdown in AI assistant - Stop button, auto-scroll, convo mode fixes - Builder upgraded to trixie (glibc 2.41 for ort/kokoro) Repos: hero_voice, hero_agent, hero_archipelagos, hero_os, hero_services Tests: 112 smoke + 20 integration, 0 failures Signed-off-by: mik-tf
Author
Owner

Complete in v0.7.1-dev: earshot VAD (pure Rust), kokoro-micro TTS (54+ voices), local Whisper STT, 3-tier routing (Kokoro→Groq→aibroker), sentence-level streaming, trackbar with pause/play/stop, Settings Voice & Audio tab. 164 tests, 0 failures.

Signed-off-by: mik-tf

Complete in v0.7.1-dev: earshot VAD (pure Rust), kokoro-micro TTS (54+ voices), local Whisper STT, 3-tier routing (Kokoro→Groq→aibroker), sentence-level streaming, trackbar with pause/play/stop, Settings Voice & Audio tab. 164 tests, 0 failures. Signed-off-by: mik-tf
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/home#78
No description provided.