Voice AI — full implementation strategy (hero_agent + hero_voice integration) #74

Closed
opened 2026-03-22 01:47:13 +00:00 by mik-tf · 5 comments
Owner

Voice AI — Complete Implementation Spec

Context

hero_agent has basic voice input working (Level 1). This issue tracks Levels 2-4 for a complete voice AI experience, including real-time conversation mode like ChatGPT voice.

Current State (Level 1 — Done)

  • Mic button in AI Assistant input bar (Hero Books toggle pattern)
  • Production STT via herolib_ai (Groq Whisper + ffmpeg WebM→MP3)
  • hero_agent self-contained (no service dependency for STT)
  • Read-aloud speaker button on AI responses (browser Speech API fallback)
  • AI Agent admin: Voice tab with 3 modes (push-to-talk, VAD, conversation)
  • /api/voice/chat endpoint (STT→Agent→TTS pipeline)
  • /api/voice/transcribe endpoint (STT only)
  • /api/voice/tts endpoint (TTS only, gpt-4o-mini-tts model)

Level 2: Enhanced Read Aloud

Goal: Every AI response can be heard. Optional auto-read mode.

  • Auto-read toggle in AI Assistant header (speaker icon, reads every response)
  • Voice selector dropdown (alloy, echo, fable, onyx, nova, shimmer) — persisted in localStorage
  • Reading indicator on message bubble while TTS plays (pulsing speaker icon)
  • Stop playback button (click speaker again while playing)
  • Queue management — if multiple read-alouds requested, play sequentially

Implementation:

  • Add auto_read signal to AI island state
  • After SSE done event, if auto_read is on, trigger TTS via eval
  • Voice selector: store in localStorage, pass to TTS endpoint
  • Browser Speech API remains as fallback when no OpenAI key

Level 3: Voice Conversation Mode

Goal: Talk to the AI naturally — like a phone call or ChatGPT voice mode.

3a. WebSocket STT via hero_voice

  • hero_agent opens WebSocket to hero_voice for real-time STT
  • Silero VAD (server-side, 350ms speech detection) replaces browser-side VAD
  • Binary PCM streaming (16kHz i16) from browser → hero_voice → Groq Whisper
  • Transcription segments arrive in real-time (no wait for full recording)

Architecture:

Browser (AudioContext + ScriptProcessor)
  → WebSocket binary PCM → hero_voice_server
    → Silero VAD (local, detects speech/silence)
    → Speech segment → Groq Whisper (cloud STT)
    → Transcript text → WebSocket back to browser
  → Browser sends transcript to hero_agent /api/chat
  → SSE streaming response
  → Response text → /api/voice/tts → audio playback
  → Auto-restart listening

Why hero_voice over browser VAD:

  • Silero VAD is more reliable than volume-threshold detection
  • 350ms silence detection vs 800ms in our browser JS
  • Server-side: works consistently across all browsers
  • Already proven in hero_voice app (production-quality)

3b. Conversation Mode UI

  • Toggle button in AI Assistant header (phone/headset icon)
  • When active: shows audio waveform/level meter, status text (Listening/Processing/Speaking)
  • Text input hidden by default but can be shown (toggle)
  • Continuous loop: listen → VAD detects end → transcribe → agent → TTS → auto-listen
  • Three sub-modes via dropdown:
    • Push-to-talk (default): hold mic button to record
    • Voice activity: auto-detect speech start/stop
    • Conversation: VAD + auto-restart after TTS playback

3c. Wake Word Detection ("Hero, ...")

  • Browser's webkitSpeechRecognition for local, free wake word detection
  • Always listening in background (low resource, no cloud cost)
  • Detects "Hero" → starts full recording via MediaRecorder
  • VAD detects end of speech → transcribe via Groq Whisper (more accurate)
  • Send to agent → TTS response → return to wake word listening
  • Visual indicator: subtle microphone icon in header showing wake word mode is active
  • Chrome-only initially (webkitSpeechRecognition), Firefox partial support

Level 4: Shared Audio Infrastructure (Future)

Goal: Unified audio library for all Hero services.

  • Extract into herolib_audio crate (NOT a service dependency):
    • Silero VAD bindings
    • WebSocket binary PCM streaming handler
    • Groq Whisper transcription (via herolib_ai)
    • ffmpeg audio conversion utilities
    • Audio format detection (WebM, MP4, WAV, OGG)
  • hero_books imports herolib_audio (removes duplicate transcription code)
  • hero_voice imports herolib_audio (shares VAD + transcription)
  • hero_agent imports herolib_audio (replaces current inline code)
  • Local Whisper option (whisper.cpp or candle-whisper) — no cloud dependency
  • Wake word model (Porcupine or custom Silero keyword detector)

Architecture principle: herolib_audio is a LIBRARY crate, not a service. Each service imports it independently. No service-to-service dependency for audio.


Audio Landscape Reference

Service STT Provider VAD Transport TTS
hero_voice Groq Whisper (cloud) Silero VAD V5 (local) WebSocket binary PCM None
hero_books Groq Whisper (cloud) + LLM cleanup None (user stops) HTTP multipart None
hero_agent Groq Whisper via herolib_ai Browser JS (admin), none (AI Assistant) HTTP multipart gpt-4o-mini-tts (planned), browser Speech API (fallback)

Dependencies

  • hero_voice (Silero VAD, WebSocket) — Level 3
  • hero_aibroker (TTS endpoint) — Level 2 (when OpenAI key added)
  • herolib_ai (Groq Whisper STT) — Level 1 (done)
  • herolib_audio (new crate) — Level 4

Signed-off-by: mik-tf

## Voice AI — Complete Implementation Spec ### Context hero_agent has basic voice input working (Level 1). This issue tracks Levels 2-4 for a complete voice AI experience, including real-time conversation mode like ChatGPT voice. ### Current State (Level 1 — Done) - [x] Mic button in AI Assistant input bar (Hero Books toggle pattern) - [x] Production STT via herolib_ai (Groq Whisper + ffmpeg WebM→MP3) - [x] hero_agent self-contained (no service dependency for STT) - [x] Read-aloud speaker button on AI responses (browser Speech API fallback) - [x] AI Agent admin: Voice tab with 3 modes (push-to-talk, VAD, conversation) - [x] `/api/voice/chat` endpoint (STT→Agent→TTS pipeline) - [x] `/api/voice/transcribe` endpoint (STT only) - [x] `/api/voice/tts` endpoint (TTS only, gpt-4o-mini-tts model) --- ### Level 2: Enhanced Read Aloud **Goal:** Every AI response can be heard. Optional auto-read mode. - [ ] Auto-read toggle in AI Assistant header (speaker icon, reads every response) - [ ] Voice selector dropdown (alloy, echo, fable, onyx, nova, shimmer) — persisted in localStorage - [ ] Reading indicator on message bubble while TTS plays (pulsing speaker icon) - [ ] Stop playback button (click speaker again while playing) - [ ] Queue management — if multiple read-alouds requested, play sequentially **Implementation:** - Add `auto_read` signal to AI island state - After SSE `done` event, if auto_read is on, trigger TTS via eval - Voice selector: store in localStorage, pass to TTS endpoint - Browser Speech API remains as fallback when no OpenAI key --- ### Level 3: Voice Conversation Mode **Goal:** Talk to the AI naturally — like a phone call or ChatGPT voice mode. #### 3a. WebSocket STT via hero_voice - [ ] hero_agent opens WebSocket to hero_voice for real-time STT - [ ] Silero VAD (server-side, 350ms speech detection) replaces browser-side VAD - [ ] Binary PCM streaming (16kHz i16) from browser → hero_voice → Groq Whisper - [ ] Transcription segments arrive in real-time (no wait for full recording) **Architecture:** ``` Browser (AudioContext + ScriptProcessor) → WebSocket binary PCM → hero_voice_server → Silero VAD (local, detects speech/silence) → Speech segment → Groq Whisper (cloud STT) → Transcript text → WebSocket back to browser → Browser sends transcript to hero_agent /api/chat → SSE streaming response → Response text → /api/voice/tts → audio playback → Auto-restart listening ``` **Why hero_voice over browser VAD:** - Silero VAD is more reliable than volume-threshold detection - 350ms silence detection vs 800ms in our browser JS - Server-side: works consistently across all browsers - Already proven in hero_voice app (production-quality) #### 3b. Conversation Mode UI - [ ] Toggle button in AI Assistant header (phone/headset icon) - [ ] When active: shows audio waveform/level meter, status text (Listening/Processing/Speaking) - [ ] Text input hidden by default but can be shown (toggle) - [ ] Continuous loop: listen → VAD detects end → transcribe → agent → TTS → auto-listen - [ ] Three sub-modes via dropdown: - **Push-to-talk** (default): hold mic button to record - **Voice activity**: auto-detect speech start/stop - **Conversation**: VAD + auto-restart after TTS playback #### 3c. Wake Word Detection ("Hero, ...") - [ ] Browser's `webkitSpeechRecognition` for local, free wake word detection - [ ] Always listening in background (low resource, no cloud cost) - [ ] Detects "Hero" → starts full recording via MediaRecorder - [ ] VAD detects end of speech → transcribe via Groq Whisper (more accurate) - [ ] Send to agent → TTS response → return to wake word listening - [ ] Visual indicator: subtle microphone icon in header showing wake word mode is active - [ ] Chrome-only initially (webkitSpeechRecognition), Firefox partial support --- ### Level 4: Shared Audio Infrastructure (Future) **Goal:** Unified audio library for all Hero services. - [ ] Extract into `herolib_audio` crate (NOT a service dependency): - Silero VAD bindings - WebSocket binary PCM streaming handler - Groq Whisper transcription (via herolib_ai) - ffmpeg audio conversion utilities - Audio format detection (WebM, MP4, WAV, OGG) - [ ] hero_books imports herolib_audio (removes duplicate transcription code) - [ ] hero_voice imports herolib_audio (shares VAD + transcription) - [ ] hero_agent imports herolib_audio (replaces current inline code) - [ ] Local Whisper option (whisper.cpp or candle-whisper) — no cloud dependency - [ ] Wake word model (Porcupine or custom Silero keyword detector) **Architecture principle:** herolib_audio is a LIBRARY crate, not a service. Each service imports it independently. No service-to-service dependency for audio. --- ### Audio Landscape Reference | Service | STT Provider | VAD | Transport | TTS | |---------|-------------|-----|-----------|-----| | hero_voice | Groq Whisper (cloud) | Silero VAD V5 (local) | WebSocket binary PCM | None | | hero_books | Groq Whisper (cloud) + LLM cleanup | None (user stops) | HTTP multipart | None | | hero_agent | Groq Whisper via herolib_ai | Browser JS (admin), none (AI Assistant) | HTTP multipart | gpt-4o-mini-tts (planned), browser Speech API (fallback) | ### Dependencies - hero_voice (Silero VAD, WebSocket) — Level 3 - hero_aibroker (TTS endpoint) — Level 2 (when OpenAI key added) - herolib_ai (Groq Whisper STT) — Level 1 (done) - herolib_audio (new crate) — Level 4 Signed-off-by: mik-tf
Author
Owner

Added: Wake word detection ("Hero, ...")

Level 3 sub-feature: in conversation mode, the AI listens continuously (using browser's webkitSpeechRecognition for local, free wake word detection). When it detects the word "Hero", it starts recording the full message via MediaRecorder, then sends to Groq Whisper for accurate transcription.

Flow:

Always listening (browser Speech Recognition, local, free)
  → Detects "Hero" wake word
  → Start recording (MediaRecorder)
  → VAD detects end of speech (or 350ms silence)
  → Transcribe via Groq Whisper (more accurate than browser STT)
  → Send to agent → TTS response → play
  → Return to listening for "Hero"

Browser-side approach is preferred over server-side because:

  • Free (no cloud cost for wake word detection)
  • Low latency (no network round trip for the keyword)
  • Privacy (audio only sent to cloud after wake word confirmed)
  • Works in Chrome (webkitSpeechRecognition), Firefox has partial support

Signed-off-by: mik-tf

## Added: Wake word detection ("Hero, ...") Level 3 sub-feature: in conversation mode, the AI listens continuously (using browser's `webkitSpeechRecognition` for local, free wake word detection). When it detects the word "Hero", it starts recording the full message via MediaRecorder, then sends to Groq Whisper for accurate transcription. Flow: ``` Always listening (browser Speech Recognition, local, free) → Detects "Hero" wake word → Start recording (MediaRecorder) → VAD detects end of speech (or 350ms silence) → Transcribe via Groq Whisper (more accurate than browser STT) → Send to agent → TTS response → play → Return to listening for "Hero" ``` Browser-side approach is preferred over server-side because: - Free (no cloud cost for wake word detection) - Low latency (no network round trip for the keyword) - Privacy (audio only sent to cloud after wake word confirmed) - Works in Chrome (webkitSpeechRecognition), Firefox has partial support Signed-off-by: mik-tf
Author
Owner

Completed — v0.6.0-dev

Levels 2 and 3 implemented (Level 1 was already done):

Level 2: Enhanced Read-Aloud

  • Auto-read toggle in AI Assistant header
  • Voice selector dropdown (alloy, echo, fable, onyx, nova, shimmer) persisted in localStorage
  • Stop playback button when reading
  • Per-message read-aloud uses selected voice

Level 3a: WebSocket STT via hero_voice

  • Conversation mode connects to hero_voice /ws endpoint
  • Streams PCM audio (16kHz, 16-bit LE, mono) via WebSocket
  • Receives transcriptions and auto-submits to agent

Level 3b: Conversation Mode UI

  • Toggle button in AI Assistant header
  • Continuous loop: listen → VAD → transcribe → agent → TTS → repeat
  • Auto-enables auto-read when activated

Level 3c: Wake Word Detection

  • Browser webkitSpeechRecognition detects "Hero" keyword
  • Activates conversation mode automatically
  • Chrome-only (webkitSpeechRecognition requirement)
  • Toggle button in header with localStorage persistence

Repos touched

  • hero_archipelagos (2 files: island.rs, message_bubble.rs)

Release

## Completed — v0.6.0-dev Levels 2 and 3 implemented (Level 1 was already done): ### Level 2: Enhanced Read-Aloud - Auto-read toggle in AI Assistant header - Voice selector dropdown (alloy, echo, fable, onyx, nova, shimmer) persisted in localStorage - Stop playback button when reading - Per-message read-aloud uses selected voice ### Level 3a: WebSocket STT via hero_voice - Conversation mode connects to hero_voice /ws endpoint - Streams PCM audio (16kHz, 16-bit LE, mono) via WebSocket - Receives transcriptions and auto-submits to agent ### Level 3b: Conversation Mode UI - Toggle button in AI Assistant header - Continuous loop: listen → VAD → transcribe → agent → TTS → repeat - Auto-enables auto-read when activated ### Level 3c: Wake Word Detection - Browser webkitSpeechRecognition detects "Hero" keyword - Activates conversation mode automatically - Chrome-only (webkitSpeechRecognition requirement) - Toggle button in header with localStorage persistence ### Repos touched - hero_archipelagos (2 files: island.rs, message_bubble.rs) ### Release - v0.6.0-dev: https://forge.ourworld.tf/lhumina_code/hero_services/releases/tag/v0.6.0-dev
Author
Owner

Remaining fixes needed (v0.6.9-dev testing)

Core features work (SSE chat, STT, MCP, system prompt) but voice UI has issues:

This round:

  1. Read aloud broken — browser blocks speechSynthesis/AudioContext when not triggered by direct user gesture. Fix: pre-warm on Read button click.
  2. Conversation persistence broken/api/conversations returns {"conversations":[...]} but client expects bare array. Conversations lost on page navigation.
  3. uv not installed — MCP execute_code tool fails. Add to Dockerfile.
  4. Convo AudioContext blocked — same user gesture issue + deprecated ScriptProcessorNode. Fix: resume AudioContext on click + use AudioWorkletNode.

Next round (issue #78):

  1. Phase 2: server-side wake word via rustpotter in hero_voice
  2. Proper server TTS via aibroker (instead of browser speechSynthesis fallback)
## Remaining fixes needed (v0.6.9-dev testing) Core features work (SSE chat, STT, MCP, system prompt) but voice UI has issues: ### This round: 1. **Read aloud broken** — browser blocks speechSynthesis/AudioContext when not triggered by direct user gesture. Fix: pre-warm on Read button click. 2. **Conversation persistence broken** — `/api/conversations` returns `{"conversations":[...]}` but client expects bare array. Conversations lost on page navigation. 3. **uv not installed** — MCP execute_code tool fails. Add to Dockerfile. 4. **Convo AudioContext blocked** — same user gesture issue + deprecated ScriptProcessorNode. Fix: resume AudioContext on click + use AudioWorkletNode. ### Next round (issue #78): 5. Phase 2: server-side wake word via rustpotter in hero_voice 6. Proper server TTS via aibroker (instead of browser speechSynthesis fallback)
mik-tf reopened this issue 2026-03-23 15:46:49 +00:00
Author
Owner

v0.7.0-dev status — remaining issues from browser console

Still broken:

  1. Read aloud: AudioContext was not allowed to start — Dioxus spawn(async { document::eval(...) }) runs outside user gesture context even when triggered from onclick. The pre-warm approach does not work because Dioxus async spawns are detached from the click event. Fix needed: either use onclick JS directly (not through Dioxus eval), or use a JS-side click listener that pre-warms immediately.

  2. Create conversation 405: POST /api/conversations returns 405 Method Not Allowed. The route exists for GET (list) but POST (create) is not registered or uses wrong method. Check hero_agent routes.rs for the conversations POST handler.

  3. Convo AudioContext blocked: Same user gesture issue as read aloud — AudioContext created in async context gets blocked.

What works:

  • SSE streaming chat ✓
  • Voice input (STT) ✓
  • MCP tools (62 discovered) ✓
  • System prompt with Hero OS context ✓
  • Skills tab ✓
  • OpenRPC spec ✓
  • uv + python3 in container ✓
  • Per-message speaker icon (needs testing — may work since it is a direct click)
  • 20/20 integration tests ✓

Key insight for read aloud fix:

The browser user gesture requirement cannot be satisfied through Dioxus document::eval() in a spawn(). The speech synthesis must be triggered DIRECTLY in the JS onclick handler, not through Rust async. Options:

  • (a) Use dangerous_inner_html to add a raw <button onclick="..."> that handles speechSynthesis directly in JS
  • (b) Use eval() synchronously in the Dioxus onclick (not in spawn/async)
  • (c) Store a global JS flag when Read is clicked, and have a MutationObserver or SSE listener in JS that auto-speaks new messages
## v0.7.0-dev status — remaining issues from browser console ### Still broken: 1. **Read aloud**: `AudioContext was not allowed to start` — Dioxus `spawn(async { document::eval(...) })` runs outside user gesture context even when triggered from onclick. The pre-warm approach does not work because Dioxus async spawns are detached from the click event. **Fix needed**: either use `onclick` JS directly (not through Dioxus eval), or use a JS-side click listener that pre-warms immediately. 2. **Create conversation 405**: POST `/api/conversations` returns 405 Method Not Allowed. The route exists for GET (list) but POST (create) is not registered or uses wrong method. Check hero_agent routes.rs for the conversations POST handler. 3. **Convo AudioContext blocked**: Same user gesture issue as read aloud — AudioContext created in async context gets blocked. ### What works: - SSE streaming chat ✓ - Voice input (STT) ✓ - MCP tools (62 discovered) ✓ - System prompt with Hero OS context ✓ - Skills tab ✓ - OpenRPC spec ✓ - uv + python3 in container ✓ - Per-message speaker icon (needs testing — may work since it is a direct click) - 20/20 integration tests ✓ ### Key insight for read aloud fix: The browser user gesture requirement cannot be satisfied through Dioxus `document::eval()` in a `spawn()`. The speech synthesis must be triggered DIRECTLY in the JS onclick handler, not through Rust async. Options: - (a) Use `dangerous_inner_html` to add a raw `<button onclick="...">` that handles speechSynthesis directly in JS - (b) Use `eval()` synchronously in the Dioxus onclick (not in spawn/async) - (c) Store a global JS flag when Read is clicked, and have a MutationObserver or SSE listener in JS that auto-speaks new messages
Author
Owner

Superseded by #80 which has the complete remaining spec. Original Level 1 done, Levels 2-3 partially done with issues documented in #80.

Superseded by #80 which has the complete remaining spec. Original Level 1 done, Levels 2-3 partially done with issues documented in #80.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/home#74
No description provided.