Enhanced AI Vocal Responsive #73

Closed
opened 2026-03-21 05:19:59 +00:00 by mik-tf · 3 comments
Owner

Situation

  • We have AI assistant, hero shrimp, ai broker, and hero voice
  • We can add parakeet/refine hero voice to allow STT and TTS

Deliverable

  • Can talk to AI vocally, it answers vocally, as a discussion
  • Use MCP to do anything we want on Hero Os via hero services
# Situation - We have AI assistant, hero shrimp, ai broker, and hero voice - We can add parakeet/refine hero voice to allow STT and TTS # Deliverable - Can talk to AI vocally, it answers vocally, as a discussion - Use MCP to do anything we want on Hero Os via hero services
Author
Owner

Backend endpoint implemented

POST /hero_agent/api/voice/chat — server-side STT → Agent → TTS pipeline:

  1. Accepts multipart audio upload (field: audio)
  2. Sends to hero_aibroker /v1/audio/transcriptions for STT (Groq Whisper)
  3. Passes transcribed text to hero_agent for processing
  4. Sends response text to hero_aibroker /v1/audio/speech for TTS
  5. Returns JSON with transcript, response text, and base64-encoded MP3 audio

Optional fields: conversation_id, voice (default: alloy)

Still needed for full deliverable

  • Frontend record button in AI island (browser MediaRecorder API)
  • Audio playback of TTS response in the UI
  • Real-time WebSocket mode for continuous conversation
  • Parakeet/local STT integration (currently uses Groq cloud)

Endpoint is live at https://herodev.gent04.grid.tf/hero_agent/api/voice/chat

Signed-off-by: mik-tf

## Backend endpoint implemented `POST /hero_agent/api/voice/chat` — server-side STT → Agent → TTS pipeline: 1. Accepts multipart audio upload (field: `audio`) 2. Sends to hero_aibroker `/v1/audio/transcriptions` for STT (Groq Whisper) 3. Passes transcribed text to hero_agent for processing 4. Sends response text to hero_aibroker `/v1/audio/speech` for TTS 5. Returns JSON with transcript, response text, and base64-encoded MP3 audio Optional fields: `conversation_id`, `voice` (default: alloy) ### Still needed for full deliverable - Frontend record button in AI island (browser MediaRecorder API) - Audio playback of TTS response in the UI - Real-time WebSocket mode for continuous conversation - Parakeet/local STT integration (currently uses Groq cloud) Endpoint is live at https://herodev.gent04.grid.tf/hero_agent/api/voice/chat Signed-off-by: mik-tf
Author
Owner

Progress — voice AI backend + frontend mic button

Done

  • /api/voice/chat endpoint (STT → Agent → TTS pipeline via hero_aibroker)
  • /api/voice/transcribe endpoint (STT only, for mic button)
  • AI Agent admin dashboard: Voice tab with 3 modes (push-to-talk, VAD, conversation)
  • AI Assistant: mic button in input bar (click to record → transcribe → auto-send)
  • AI Broker endpoint fixed (localhost:9997 proxy, not TCP:8080)

Architecture findings

  • hero_voice has excellent WebSocket streaming + Silero VAD (local, 350ms latency)
  • All STT uses cloud Groq Whisper (hero_voice, hero_books, hero_agent) — nothing is local STT
  • Silero VAD is the only local component (server-side voice activity detection)
  • hero_aibroker has TTS via OpenAI /v1/audio/speech endpoint

Remaining (see issue #74 for full plan)

  • Level 1: Polish mic button UX, add read-aloud on responses
  • Level 2: Speaker icon on each AI response + auto-read toggle
  • Level 3: Voice conversation mode via WebSocket (hero_voice integration)
  • Level 4: Shared audio library (herolib_audio) for all services

Signed-off-by: mik-tf

## Progress — voice AI backend + frontend mic button ### Done - `/api/voice/chat` endpoint (STT → Agent → TTS pipeline via hero_aibroker) - `/api/voice/transcribe` endpoint (STT only, for mic button) - AI Agent admin dashboard: Voice tab with 3 modes (push-to-talk, VAD, conversation) - AI Assistant: mic button in input bar (click to record → transcribe → auto-send) - AI Broker endpoint fixed (localhost:9997 proxy, not TCP:8080) ### Architecture findings - hero_voice has excellent WebSocket streaming + Silero VAD (local, 350ms latency) - All STT uses cloud Groq Whisper (hero_voice, hero_books, hero_agent) — nothing is local STT - Silero VAD is the only local component (server-side voice activity detection) - hero_aibroker has TTS via OpenAI `/v1/audio/speech` endpoint ### Remaining (see issue #74 for full plan) - Level 1: Polish mic button UX, add read-aloud on responses - Level 2: Speaker icon on each AI response + auto-read toggle - Level 3: Voice conversation mode via WebSocket (hero_voice integration) - Level 4: Shared audio library (herolib_audio) for all services Signed-off-by: mik-tf
Author
Owner

Voice AI — Level 1 complete

Working

  • Mic button in AI Assistant (Hero Books toggle pattern, no race conditions)
  • Production STT via herolib_ai (direct Groq Whisper, ffmpeg WebM→MP3 conversion)
  • hero_agent self-contained (no dependency on hero_books or hero_aibroker for STT)
  • Read-aloud speaker button on AI responses (server TTS → browser Speech API fallback)
  • AI Agent admin: Voice tab with 3 modes (push-to-talk, VAD, conversation)
  • TTS endpoint with gpt-4o-mini-tts model (ready when OpenAI key added)

Remaining (tracked in #74)

  • Level 2: Auto-read toggle, voice selector
  • Level 3: WebSocket conversation mode via hero_voice, wake word "Hero"
  • Level 4: Shared herolib_audio crate

Signed-off-by: mik-tf

## Voice AI — Level 1 complete ### Working - Mic button in AI Assistant (Hero Books toggle pattern, no race conditions) - Production STT via herolib_ai (direct Groq Whisper, ffmpeg WebM→MP3 conversion) - hero_agent self-contained (no dependency on hero_books or hero_aibroker for STT) - Read-aloud speaker button on AI responses (server TTS → browser Speech API fallback) - AI Agent admin: Voice tab with 3 modes (push-to-talk, VAD, conversation) - TTS endpoint with gpt-4o-mini-tts model (ready when OpenAI key added) ### Remaining (tracked in #74) - Level 2: Auto-read toggle, voice selector - Level 3: WebSocket conversation mode via hero_voice, wake word "Hero" - Level 4: Shared herolib_audio crate Signed-off-by: mik-tf
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/home#73
No description provided.