#74 - Voice AI — full implementation strategy (hero_agent + hero_voice integration) - lhumina_code/home

mik-tf commented

2026-03-22 01:47:13 +00:00

Owner

Voice AI — Complete Implementation Spec

Context

hero_agent has basic voice input working (Level 1). This issue tracks Levels 2-4 for a complete voice AI experience, including real-time conversation mode like ChatGPT voice.

Current State (Level 1 — Done)

Mic button in AI Assistant input bar (Hero Books toggle pattern)
Production STT via herolib_ai (Groq Whisper + ffmpeg WebM→MP3)
hero_agent self-contained (no service dependency for STT)
Read-aloud speaker button on AI responses (browser Speech API fallback)
AI Agent admin: Voice tab with 3 modes (push-to-talk, VAD, conversation)
/api/voice/chat endpoint (STT→Agent→TTS pipeline)
/api/voice/transcribe endpoint (STT only)
/api/voice/tts endpoint (TTS only, gpt-4o-mini-tts model)

Level 2: Enhanced Read Aloud

Goal: Every AI response can be heard. Optional auto-read mode.

Auto-read toggle in AI Assistant header (speaker icon, reads every response)
Voice selector dropdown (alloy, echo, fable, onyx, nova, shimmer) — persisted in localStorage
Reading indicator on message bubble while TTS plays (pulsing speaker icon)
Stop playback button (click speaker again while playing)
Queue management — if multiple read-alouds requested, play sequentially

Implementation:

Add auto_read signal to AI island state
After SSE done event, if auto_read is on, trigger TTS via eval
Voice selector: store in localStorage, pass to TTS endpoint
Browser Speech API remains as fallback when no OpenAI key

Level 3: Voice Conversation Mode

Goal: Talk to the AI naturally — like a phone call or ChatGPT voice mode.

3a. WebSocket STT via hero_voice

hero_agent opens WebSocket to hero_voice for real-time STT
Silero VAD (server-side, 350ms speech detection) replaces browser-side VAD
Binary PCM streaming (16kHz i16) from browser → hero_voice → Groq Whisper
Transcription segments arrive in real-time (no wait for full recording)

Architecture:

Browser (AudioContext + ScriptProcessor)
  → WebSocket binary PCM → hero_voice_server
    → Silero VAD (local, detects speech/silence)
    → Speech segment → Groq Whisper (cloud STT)
    → Transcript text → WebSocket back to browser
  → Browser sends transcript to hero_agent /api/chat
  → SSE streaming response
  → Response text → /api/voice/tts → audio playback
  → Auto-restart listening

Why hero_voice over browser VAD:

Silero VAD is more reliable than volume-threshold detection
350ms silence detection vs 800ms in our browser JS
Server-side: works consistently across all browsers
Already proven in hero_voice app (production-quality)

3b. Conversation Mode UI

Toggle button in AI Assistant header (phone/headset icon)
When active: shows audio waveform/level meter, status text (Listening/Processing/Speaking)
Text input hidden by default but can be shown (toggle)
Continuous loop: listen → VAD detects end → transcribe → agent → TTS → auto-listen
Three sub-modes via dropdown:
- Push-to-talk (default): hold mic button to record
- Voice activity: auto-detect speech start/stop
- Conversation: VAD + auto-restart after TTS playback

3c. Wake Word Detection ("Hero, ...")

Browser's webkitSpeechRecognition for local, free wake word detection
Always listening in background (low resource, no cloud cost)
Detects "Hero" → starts full recording via MediaRecorder
VAD detects end of speech → transcribe via Groq Whisper (more accurate)
Send to agent → TTS response → return to wake word listening
Visual indicator: subtle microphone icon in header showing wake word mode is active
Chrome-only initially (webkitSpeechRecognition), Firefox partial support

Level 4: Shared Audio Infrastructure (Future)

Goal: Unified audio library for all Hero services.

Extract into herolib_audio crate (NOT a service dependency):
- Silero VAD bindings
- WebSocket binary PCM streaming handler
- Groq Whisper transcription (via herolib_ai)
- ffmpeg audio conversion utilities
- Audio format detection (WebM, MP4, WAV, OGG)
hero_books imports herolib_audio (removes duplicate transcription code)
hero_voice imports herolib_audio (shares VAD + transcription)
hero_agent imports herolib_audio (replaces current inline code)
Local Whisper option (whisper.cpp or candle-whisper) — no cloud dependency
Wake word model (Porcupine or custom Silero keyword detector)

Architecture principle: herolib_audio is a LIBRARY crate, not a service. Each service imports it independently. No service-to-service dependency for audio.

Audio Landscape Reference

Service	STT Provider	VAD	Transport	TTS
hero_voice	Groq Whisper (cloud)	Silero VAD V5 (local)	WebSocket binary PCM	None
hero_books	Groq Whisper (cloud) + LLM cleanup	None (user stops)	HTTP multipart	None
hero_agent	Groq Whisper via herolib_ai	Browser JS (admin), none (AI Assistant)	HTTP multipart	gpt-4o-mini-tts (planned), browser Speech API (fallback)

Dependencies

hero_voice (Silero VAD, WebSocket) — Level 3
hero_aibroker (TTS endpoint) — Level 2 (when OpenAI key added)
herolib_ai (Groq Whisper STT) — Level 1 (done)
herolib_audio (new crate) — Level 4

Signed-off-by: mik-tf

## Voice AI — Complete Implementation Spec ### Context hero_agent has basic voice input working (Level 1). This issue tracks Levels 2-4 for a complete voice AI experience, including real-time conversation mode like ChatGPT voice. ### Current State (Level 1 — Done) - [x] Mic button in AI Assistant input bar (Hero Books toggle pattern) - [x] Production STT via herolib_ai (Groq Whisper + ffmpeg WebM→MP3) - [x] hero_agent self-contained (no service dependency for STT) - [x] Read-aloud speaker button on AI responses (browser Speech API fallback) - [x] AI Agent admin: Voice tab with 3 modes (push-to-talk, VAD, conversation) - [x] `/api/voice/chat` endpoint (STT→Agent→TTS pipeline) - [x] `/api/voice/transcribe` endpoint (STT only) - [x] `/api/voice/tts` endpoint (TTS only, gpt-4o-mini-tts model) --- ### Level 2: Enhanced Read Aloud **Goal:** Every AI response can be heard. Optional auto-read mode. - [ ] Auto-read toggle in AI Assistant header (speaker icon, reads every response) - [ ] Voice selector dropdown (alloy, echo, fable, onyx, nova, shimmer) — persisted in localStorage - [ ] Reading indicator on message bubble while TTS plays (pulsing speaker icon) - [ ] Stop playback button (click speaker again while playing) - [ ] Queue management — if multiple read-alouds requested, play sequentially **Implementation:** - Add `auto_read` signal to AI island state - After SSE `done` event, if auto_read is on, trigger TTS via eval - Voice selector: store in localStorage, pass to TTS endpoint - Browser Speech API remains as fallback when no OpenAI key --- ### Level 3: Voice Conversation Mode **Goal:** Talk to the AI naturally — like a phone call or ChatGPT voice mode. #### 3a. WebSocket STT via hero_voice - [ ] hero_agent opens WebSocket to hero_voice for real-time STT - [ ] Silero VAD (server-side, 350ms speech detection) replaces browser-side VAD - [ ] Binary PCM streaming (16kHz i16) from browser → hero_voice → Groq Whisper - [ ] Transcription segments arrive in real-time (no wait for full recording) **Architecture:** ``` Browser (AudioContext + ScriptProcessor) → WebSocket binary PCM → hero_voice_server → Silero VAD (local, detects speech/silence) → Speech segment → Groq Whisper (cloud STT) → Transcript text → WebSocket back to browser → Browser sends transcript to hero_agent /api/chat → SSE streaming response → Response text → /api/voice/tts → audio playback → Auto-restart listening ``` **Why hero_voice over browser VAD:** - Silero VAD is more reliable than volume-threshold detection - 350ms silence detection vs 800ms in our browser JS - Server-side: works consistently across all browsers - Already proven in hero_voice app (production-quality) #### 3b. Conversation Mode UI - [ ] Toggle button in AI Assistant header (phone/headset icon) - [ ] When active: shows audio waveform/level meter, status text (Listening/Processing/Speaking) - [ ] Text input hidden by default but can be shown (toggle) - [ ] Continuous loop: listen → VAD detects end → transcribe → agent → TTS → auto-listen - [ ] Three sub-modes via dropdown: - **Push-to-talk** (default): hold mic button to record - **Voice activity**: auto-detect speech start/stop - **Conversation**: VAD + auto-restart after TTS playback #### 3c. Wake Word Detection ("Hero, ...") - [ ] Browser's `webkitSpeechRecognition` for local, free wake word detection - [ ] Always listening in background (low resource, no cloud cost) - [ ] Detects "Hero" → starts full recording via MediaRecorder - [ ] VAD detects end of speech → transcribe via Groq Whisper (more accurate) - [ ] Send to agent → TTS response → return to wake word listening - [ ] Visual indicator: subtle microphone icon in header showing wake word mode is active - [ ] Chrome-only initially (webkitSpeechRecognition), Firefox partial support --- ### Level 4: Shared Audio Infrastructure (Future) **Goal:** Unified audio library for all Hero services. - [ ] Extract into `herolib_audio` crate (NOT a service dependency): - Silero VAD bindings - WebSocket binary PCM streaming handler - Groq Whisper transcription (via herolib_ai) - ffmpeg audio conversion utilities - Audio format detection (WebM, MP4, WAV, OGG) - [ ] hero_books imports herolib_audio (removes duplicate transcription code) - [ ] hero_voice imports herolib_audio (shares VAD + transcription) - [ ] hero_agent imports herolib_audio (replaces current inline code) - [ ] Local Whisper option (whisper.cpp or candle-whisper) — no cloud dependency - [ ] Wake word model (Porcupine or custom Silero keyword detector) **Architecture principle:** herolib_audio is a LIBRARY crate, not a service. Each service imports it independently. No service-to-service dependency for audio. --- ### Audio Landscape Reference | Service | STT Provider | VAD | Transport | TTS | |---------|-------------|-----|-----------|-----| | hero_voice | Groq Whisper (cloud) | Silero VAD V5 (local) | WebSocket binary PCM | None | | hero_books | Groq Whisper (cloud) + LLM cleanup | None (user stops) | HTTP multipart | None | | hero_agent | Groq Whisper via herolib_ai | Browser JS (admin), none (AI Assistant) | HTTP multipart | gpt-4o-mini-tts (planned), browser Speech API (fallback) | ### Dependencies - hero_voice (Silero VAD, WebSocket) — Level 3 - hero_aibroker (TTS endpoint) — Level 2 (when OpenAI key added) - herolib_ai (Groq Whisper STT) — Level 1 (done) - herolib_audio (new crate) — Level 4 Signed-off-by: mik-tf

mik-tf commented

2026-03-22 02:39:19 +00:00

Author

Owner

Added: Wake word detection ("Hero, ...")

Level 3 sub-feature: in conversation mode, the AI listens continuously (using browser's webkitSpeechRecognition for local, free wake word detection). When it detects the word "Hero", it starts recording the full message via MediaRecorder, then sends to Groq Whisper for accurate transcription.

Flow:

Always listening (browser Speech Recognition, local, free)
  → Detects "Hero" wake word
  → Start recording (MediaRecorder)
  → VAD detects end of speech (or 350ms silence)
  → Transcribe via Groq Whisper (more accurate than browser STT)
  → Send to agent → TTS response → play
  → Return to listening for "Hero"

Browser-side approach is preferred over server-side because:

Free (no cloud cost for wake word detection)
Low latency (no network round trip for the keyword)
Privacy (audio only sent to cloud after wake word confirmed)
Works in Chrome (webkitSpeechRecognition), Firefox has partial support

Signed-off-by: mik-tf

## Added: Wake word detection ("Hero, ...") Level 3 sub-feature: in conversation mode, the AI listens continuously (using browser's `webkitSpeechRecognition` for local, free wake word detection). When it detects the word "Hero", it starts recording the full message via MediaRecorder, then sends to Groq Whisper for accurate transcription. Flow: ``` Always listening (browser Speech Recognition, local, free) → Detects "Hero" wake word → Start recording (MediaRecorder) → VAD detects end of speech (or 350ms silence) → Transcribe via Groq Whisper (more accurate than browser STT) → Send to agent → TTS response → play → Return to listening for "Hero" ``` Browser-side approach is preferred over server-side because: - Free (no cloud cost for wake word detection) - Low latency (no network round trip for the keyword) - Privacy (audio only sent to cloud after wake word confirmed) - Works in Chrome (webkitSpeechRecognition), Firefox has partial support Signed-off-by: mik-tf

mik-tf referenced this issue

2026-03-22 16:44:52 +00:00

Enhanced AI Vocal Responsive #73

mik-tf referenced this issue

2026-03-22 16:49:43 +00:00

AI Agent knowledge pipeline — MCP integration + Hero OS context #75

mik-tf commented

2026-03-22 17:50:47 +00:00

Author

Owner

Completed — v0.6.0-dev

Levels 2 and 3 implemented (Level 1 was already done):

Level 2: Enhanced Read-Aloud

Auto-read toggle in AI Assistant header
Voice selector dropdown (alloy, echo, fable, onyx, nova, shimmer) persisted in localStorage
Stop playback button when reading
Per-message read-aloud uses selected voice

Level 3a: WebSocket STT via hero_voice

Conversation mode connects to hero_voice /ws endpoint
Streams PCM audio (16kHz, 16-bit LE, mono) via WebSocket
Receives transcriptions and auto-submits to agent

Level 3b: Conversation Mode UI

Toggle button in AI Assistant header
Continuous loop: listen → VAD → transcribe → agent → TTS → repeat
Auto-enables auto-read when activated

Level 3c: Wake Word Detection

Browser webkitSpeechRecognition detects "Hero" keyword
Activates conversation mode automatically
Chrome-only (webkitSpeechRecognition requirement)
Toggle button in header with localStorage persistence

Repos touched

hero_archipelagos (2 files: island.rs, message_bubble.rs)

Release

v0.6.0-dev: https://forge.ourworld.tf/lhumina_code/hero_services/releases/tag/v0.6.0-dev

## Completed — v0.6.0-dev Levels 2 and 3 implemented (Level 1 was already done): ### Level 2: Enhanced Read-Aloud - Auto-read toggle in AI Assistant header - Voice selector dropdown (alloy, echo, fable, onyx, nova, shimmer) persisted in localStorage - Stop playback button when reading - Per-message read-aloud uses selected voice ### Level 3a: WebSocket STT via hero_voice - Conversation mode connects to hero_voice /ws endpoint - Streams PCM audio (16kHz, 16-bit LE, mono) via WebSocket - Receives transcriptions and auto-submits to agent ### Level 3b: Conversation Mode UI - Toggle button in AI Assistant header - Continuous loop: listen → VAD → transcribe → agent → TTS → repeat - Auto-enables auto-read when activated ### Level 3c: Wake Word Detection - Browser webkitSpeechRecognition detects "Hero" keyword - Activates conversation mode automatically - Chrome-only (webkitSpeechRecognition requirement) - Toggle button in header with localStorage persistence ### Repos touched - hero_archipelagos (2 files: island.rs, message_bubble.rs) ### Release - v0.6.0-dev: https://forge.ourworld.tf/lhumina_code/hero_services/releases/tag/v0.6.0-dev

mik-tf closed this issue

2026-03-22 17:50:48 +00:00

mik-tf referenced this issue

2026-03-22 18:35:41 +00:00

herodev v0.6.0-dev: 6 smoke test failures to fix #76

mik-tf referenced this issue

2026-03-23 13:16:32 +00:00

Voice AI Phase 2: cross-browser wake word + local Whisper STT #78

mik-tf commented

2026-03-23 15:46:49 +00:00

Author

Owner

Remaining fixes needed (v0.6.9-dev testing)

Core features work (SSE chat, STT, MCP, system prompt) but voice UI has issues:

This round:

Read aloud broken — browser blocks speechSynthesis/AudioContext when not triggered by direct user gesture. Fix: pre-warm on Read button click.
Conversation persistence broken — /api/conversations returns {"conversations":[...]} but client expects bare array. Conversations lost on page navigation.
uv not installed — MCP execute_code tool fails. Add to Dockerfile.
Convo AudioContext blocked — same user gesture issue + deprecated ScriptProcessorNode. Fix: resume AudioContext on click + use AudioWorkletNode.

Next round (issue #78):

Phase 2: server-side wake word via rustpotter in hero_voice
Proper server TTS via aibroker (instead of browser speechSynthesis fallback)

## Remaining fixes needed (v0.6.9-dev testing) Core features work (SSE chat, STT, MCP, system prompt) but voice UI has issues: ### This round: 1. **Read aloud broken** — browser blocks speechSynthesis/AudioContext when not triggered by direct user gesture. Fix: pre-warm on Read button click. 2. **Conversation persistence broken** — `/api/conversations` returns `{"conversations":[...]}` but client expects bare array. Conversations lost on page navigation. 3. **uv not installed** — MCP execute_code tool fails. Add to Dockerfile. 4. **Convo AudioContext blocked** — same user gesture issue + deprecated ScriptProcessorNode. Fix: resume AudioContext on click + use AudioWorkletNode. ### Next round (issue #78): 5. Phase 2: server-side wake word via rustpotter in hero_voice 6. Proper server TTS via aibroker (instead of browser speechSynthesis fallback)

mik-tf reopened this issue

2026-03-23 15:46:49 +00:00

mik-tf commented

2026-03-23 16:23:13 +00:00

Author

Owner

v0.7.0-dev status — remaining issues from browser console

Still broken:

Read aloud: AudioContext was not allowed to start — Dioxus spawn(async { document::eval(...) }) runs outside user gesture context even when triggered from onclick. The pre-warm approach does not work because Dioxus async spawns are detached from the click event. Fix needed: either use onclick JS directly (not through Dioxus eval), or use a JS-side click listener that pre-warms immediately.
Create conversation 405: POST /api/conversations returns 405 Method Not Allowed. The route exists for GET (list) but POST (create) is not registered or uses wrong method. Check hero_agent routes.rs for the conversations POST handler.
Convo AudioContext blocked: Same user gesture issue as read aloud — AudioContext created in async context gets blocked.

What works:

SSE streaming chat ✓
Voice input (STT) ✓
MCP tools (62 discovered) ✓
System prompt with Hero OS context ✓
Skills tab ✓
OpenRPC spec ✓
uv + python3 in container ✓
Per-message speaker icon (needs testing — may work since it is a direct click)
20/20 integration tests ✓

Key insight for read aloud fix:

The browser user gesture requirement cannot be satisfied through Dioxus document::eval() in a spawn(). The speech synthesis must be triggered DIRECTLY in the JS onclick handler, not through Rust async. Options:

(a) Use dangerous_inner_html to add a raw <button onclick="..."> that handles speechSynthesis directly in JS
(b) Use eval() synchronously in the Dioxus onclick (not in spawn/async)
(c) Store a global JS flag when Read is clicked, and have a MutationObserver or SSE listener in JS that auto-speaks new messages

## v0.7.0-dev status — remaining issues from browser console ### Still broken: 1. **Read aloud**: `AudioContext was not allowed to start` — Dioxus `spawn(async { document::eval(...) })` runs outside user gesture context even when triggered from onclick. The pre-warm approach does not work because Dioxus async spawns are detached from the click event. **Fix needed**: either use `onclick` JS directly (not through Dioxus eval), or use a JS-side click listener that pre-warms immediately. 2. **Create conversation 405**: POST `/api/conversations` returns 405 Method Not Allowed. The route exists for GET (list) but POST (create) is not registered or uses wrong method. Check hero_agent routes.rs for the conversations POST handler. 3. **Convo AudioContext blocked**: Same user gesture issue as read aloud — AudioContext created in async context gets blocked. ### What works: - SSE streaming chat ✓ - Voice input (STT) ✓ - MCP tools (62 discovered) ✓ - System prompt with Hero OS context ✓ - Skills tab ✓ - OpenRPC spec ✓ - uv + python3 in container ✓ - Per-message speaker icon (needs testing — may work since it is a direct click) - 20/20 integration tests ✓ ### Key insight for read aloud fix: The browser user gesture requirement cannot be satisfied through Dioxus `document::eval()` in a `spawn()`. The speech synthesis must be triggered DIRECTLY in the JS onclick handler, not through Rust async. Options: - (a) Use `dangerous_inner_html` to add a raw `<button onclick="...">` that handles speechSynthesis directly in JS - (b) Use `eval()` synchronously in the Dioxus onclick (not in spawn/async) - (c) Store a global JS flag when Read is clicked, and have a MutationObserver or SSE listener in JS that auto-speaks new messages

mik-tf commented

2026-03-23 16:25:32 +00:00

Author

Owner

Superseded by #80 which has the complete remaining spec. Original Level 1 done, Levels 2-3 partially done with issues documented in #80.

mik-tf closed this issue

2026-03-23 16:25:32 +00:00

Rows
Columns

Voice AI — full implementation strategy (hero_agent + hero_voice integration) #74

Voice AI — Complete Implementation Spec

Context

Current State (Level 1 — Done)

Level 2: Enhanced Read Aloud

Level 3: Voice Conversation Mode

3a. WebSocket STT via hero_voice

3b. Conversation Mode UI

3c. Wake Word Detection ("Hero, ...")

Level 4: Shared Audio Infrastructure (Future)

Audio Landscape Reference

Dependencies

Added: Wake word detection ("Hero, ...")

Completed — v0.6.0-dev

Level 2: Enhanced Read-Aloud

Level 3a: WebSocket STT via hero_voice

Level 3b: Conversation Mode UI

Level 3c: Wake Word Detection

Repos touched

Release

Remaining fixes needed (v0.6.9-dev testing)

This round:

Next round (issue #78):

v0.7.0-dev status — remaining issues from browser console

Still broken:

What works:

Key insight for read aloud fix: