lhumina_code/home

Fork 0

Enhanced AI Vocal Responsive #73

New issue

Closed

opened 2026-03-21 05:19:59 +00:00 by mik-tf · 3 comments

mik-tf commented

2026-03-21 05:19:59 +00:00

Owner

Situation

We have AI assistant, hero shrimp, ai broker, and hero voice
We can add parakeet/refine hero voice to allow STT and TTS

Deliverable

Can talk to AI vocally, it answers vocally, as a discussion
Use MCP to do anything we want on Hero Os via hero services

# Situation - We have AI assistant, hero shrimp, ai broker, and hero voice - We can add parakeet/refine hero voice to allow STT and TTS # Deliverable - Can talk to AI vocally, it answers vocally, as a discussion - Use MCP to do anything we want on Hero Os via hero services

mik-tf commented

2026-03-21 17:45:32 +00:00

Author

Owner

Backend endpoint implemented

POST /hero_agent/api/voice/chat — server-side STT → Agent → TTS pipeline:

Accepts multipart audio upload (field: audio)
Sends to hero_aibroker /v1/audio/transcriptions for STT (Groq Whisper)
Passes transcribed text to hero_agent for processing
Sends response text to hero_aibroker /v1/audio/speech for TTS
Returns JSON with transcript, response text, and base64-encoded MP3 audio

Optional fields: conversation_id, voice (default: alloy)

Still needed for full deliverable

Frontend record button in AI island (browser MediaRecorder API)
Audio playback of TTS response in the UI
Real-time WebSocket mode for continuous conversation
Parakeet/local STT integration (currently uses Groq cloud)

Endpoint is live at https://herodev.gent04.grid.tf/hero_agent/api/voice/chat

Signed-off-by: mik-tf

## Backend endpoint implemented `POST /hero_agent/api/voice/chat` — server-side STT → Agent → TTS pipeline: 1. Accepts multipart audio upload (field: `audio`) 2. Sends to hero_aibroker `/v1/audio/transcriptions` for STT (Groq Whisper) 3. Passes transcribed text to hero_agent for processing 4. Sends response text to hero_aibroker `/v1/audio/speech` for TTS 5. Returns JSON with transcript, response text, and base64-encoded MP3 audio Optional fields: `conversation_id`, `voice` (default: alloy) ### Still needed for full deliverable - Frontend record button in AI island (browser MediaRecorder API) - Audio playback of TTS response in the UI - Real-time WebSocket mode for continuous conversation - Parakeet/local STT integration (currently uses Groq cloud) Endpoint is live at https://herodev.gent04.grid.tf/hero_agent/api/voice/chat Signed-off-by: mik-tf

mik-tf commented

2026-03-22 01:47:13 +00:00

Author

Owner

Progress — voice AI backend + frontend mic button

Done

/api/voice/chat endpoint (STT → Agent → TTS pipeline via hero_aibroker)
/api/voice/transcribe endpoint (STT only, for mic button)
AI Agent admin dashboard: Voice tab with 3 modes (push-to-talk, VAD, conversation)
AI Assistant: mic button in input bar (click to record → transcribe → auto-send)
AI Broker endpoint fixed (localhost:9997 proxy, not TCP:8080)

Architecture findings

hero_voice has excellent WebSocket streaming + Silero VAD (local, 350ms latency)
All STT uses cloud Groq Whisper (hero_voice, hero_books, hero_agent) — nothing is local STT
Silero VAD is the only local component (server-side voice activity detection)
hero_aibroker has TTS via OpenAI /v1/audio/speech endpoint

Remaining (see issue #74 for full plan)

Level 1: Polish mic button UX, add read-aloud on responses
Level 2: Speaker icon on each AI response + auto-read toggle
Level 3: Voice conversation mode via WebSocket (hero_voice integration)
Level 4: Shared audio library (herolib_audio) for all services

Signed-off-by: mik-tf

## Progress — voice AI backend + frontend mic button ### Done - `/api/voice/chat` endpoint (STT → Agent → TTS pipeline via hero_aibroker) - `/api/voice/transcribe` endpoint (STT only, for mic button) - AI Agent admin dashboard: Voice tab with 3 modes (push-to-talk, VAD, conversation) - AI Assistant: mic button in input bar (click to record → transcribe → auto-send) - AI Broker endpoint fixed (localhost:9997 proxy, not TCP:8080) ### Architecture findings - hero_voice has excellent WebSocket streaming + Silero VAD (local, 350ms latency) - All STT uses cloud Groq Whisper (hero_voice, hero_books, hero_agent) — nothing is local STT - Silero VAD is the only local component (server-side voice activity detection) - hero_aibroker has TTS via OpenAI `/v1/audio/speech` endpoint ### Remaining (see issue #74 for full plan) - Level 1: Polish mic button UX, add read-aloud on responses - Level 2: Speaker icon on each AI response + auto-read toggle - Level 3: Voice conversation mode via WebSocket (hero_voice integration) - Level 4: Shared audio library (herolib_audio) for all services Signed-off-by: mik-tf

~~mik-tf referenced this issue 2026-03-22 01:47:13 +00:00~~

Voice AI — full implementation strategy (hero_agent + hero_voice integration) #74

mik-tf commented

2026-03-22 16:44:52 +00:00

Author

Owner

Voice AI — Level 1 complete

Working

Mic button in AI Assistant (Hero Books toggle pattern, no race conditions)
Production STT via herolib_ai (direct Groq Whisper, ffmpeg WebM→MP3 conversion)
hero_agent self-contained (no dependency on hero_books or hero_aibroker for STT)
Read-aloud speaker button on AI responses (server TTS → browser Speech API fallback)
AI Agent admin: Voice tab with 3 modes (push-to-talk, VAD, conversation)
TTS endpoint with gpt-4o-mini-tts model (ready when OpenAI key added)

Remaining (tracked in #74)

Level 2: Auto-read toggle, voice selector
Level 3: WebSocket conversation mode via hero_voice, wake word "Hero"
Level 4: Shared herolib_audio crate

Signed-off-by: mik-tf

## Voice AI — Level 1 complete ### Working - Mic button in AI Assistant (Hero Books toggle pattern, no race conditions) - Production STT via herolib_ai (direct Groq Whisper, ffmpeg WebM→MP3 conversion) - hero_agent self-contained (no dependency on hero_books or hero_aibroker for STT) - Read-aloud speaker button on AI responses (server TTS → browser Speech API fallback) - AI Agent admin: Voice tab with 3 modes (push-to-talk, VAD, conversation) - TTS endpoint with gpt-4o-mini-tts model (ready when OpenAI key added) ### Remaining (tracked in #74) - Level 2: Auto-read toggle, voice selector - Level 3: WebSocket conversation mode via hero_voice, wake word "Hero" - Level 4: Shared herolib_audio crate Signed-off-by: mik-tf

mik-tf closed this issue

2026-03-22 16:44:53 +00:00

mik-tf referenced this issue

2026-04-13 13:34:01 +00:00

Migrate all repos to directory socket convention #116

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

lhumina_code/home#73

No description provided.

Rows
Columns