Voice AI — full implementation strategy (hero_agent + hero_voice integration) #74
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Voice AI — Complete Implementation Spec
Context
hero_agent has basic voice input working (Level 1). This issue tracks Levels 2-4 for a complete voice AI experience, including real-time conversation mode like ChatGPT voice.
Current State (Level 1 — Done)
/api/voice/chatendpoint (STT→Agent→TTS pipeline)/api/voice/transcribeendpoint (STT only)/api/voice/ttsendpoint (TTS only, gpt-4o-mini-tts model)Level 2: Enhanced Read Aloud
Goal: Every AI response can be heard. Optional auto-read mode.
Implementation:
auto_readsignal to AI island statedoneevent, if auto_read is on, trigger TTS via evalLevel 3: Voice Conversation Mode
Goal: Talk to the AI naturally — like a phone call or ChatGPT voice mode.
3a. WebSocket STT via hero_voice
Architecture:
Why hero_voice over browser VAD:
3b. Conversation Mode UI
3c. Wake Word Detection ("Hero, ...")
webkitSpeechRecognitionfor local, free wake word detectionLevel 4: Shared Audio Infrastructure (Future)
Goal: Unified audio library for all Hero services.
herolib_audiocrate (NOT a service dependency):Architecture principle: herolib_audio is a LIBRARY crate, not a service. Each service imports it independently. No service-to-service dependency for audio.
Audio Landscape Reference
Dependencies
Signed-off-by: mik-tf
Added: Wake word detection ("Hero, ...")
Level 3 sub-feature: in conversation mode, the AI listens continuously (using browser's
webkitSpeechRecognitionfor local, free wake word detection). When it detects the word "Hero", it starts recording the full message via MediaRecorder, then sends to Groq Whisper for accurate transcription.Flow:
Browser-side approach is preferred over server-side because:
Signed-off-by: mik-tf
Completed — v0.6.0-dev
Levels 2 and 3 implemented (Level 1 was already done):
Level 2: Enhanced Read-Aloud
Level 3a: WebSocket STT via hero_voice
Level 3b: Conversation Mode UI
Level 3c: Wake Word Detection
Repos touched
Release
Remaining fixes needed (v0.6.9-dev testing)
Core features work (SSE chat, STT, MCP, system prompt) but voice UI has issues:
This round:
/api/conversationsreturns{"conversations":[...]}but client expects bare array. Conversations lost on page navigation.Next round (issue #78):
v0.7.0-dev status — remaining issues from browser console
Still broken:
Read aloud:
AudioContext was not allowed to start— Dioxusspawn(async { document::eval(...) })runs outside user gesture context even when triggered from onclick. The pre-warm approach does not work because Dioxus async spawns are detached from the click event. Fix needed: either useonclickJS directly (not through Dioxus eval), or use a JS-side click listener that pre-warms immediately.Create conversation 405: POST
/api/conversationsreturns 405 Method Not Allowed. The route exists for GET (list) but POST (create) is not registered or uses wrong method. Check hero_agent routes.rs for the conversations POST handler.Convo AudioContext blocked: Same user gesture issue as read aloud — AudioContext created in async context gets blocked.
What works:
Key insight for read aloud fix:
The browser user gesture requirement cannot be satisfied through Dioxus
document::eval()in aspawn(). The speech synthesis must be triggered DIRECTLY in the JS onclick handler, not through Rust async. Options:dangerous_inner_htmlto add a raw<button onclick="...">that handles speechSynthesis directly in JSeval()synchronously in the Dioxus onclick (not in spawn/async)Superseded by #80 which has the complete remaining spec. Original Level 1 done, Levels 2-3 partially done with issues documented in #80.