Voice-to-text widgets backed by local STT daemon with Groq cloud fallback.
  • JavaScript 50.5%
  • Rust 34.9%
  • HTML 9.1%
  • CSS 4.6%
  • CMake 0.7%
  • Other 0.2%
Find a file
Kristof 0710b48dab
Some checks failed
lab release / release (push) Failing after 16s
feat: stream VAD segments to silence-stripped Ogg/Opus archiver
Replace the full-session WAV buffer approach (decode everything, write on `end`)
with a background `SpeechOggArchiver` that receives completed VAD speech segments
over a channel and re-encodes them incrementally to one Ogg/Opus file. Because
only speech segments are appended — never inter-segment silence — the archived
file is silence-stripped. Adds `OggOpusRecorder::start_at` (caller-controlled
path), the new `SpeechOggArchiver` struct with `append`/`finish` API and a round-
trip test, replaces `service::save_wav_recording` with `new_recording_path`, and
updates the WebSocket handler to send `topic` in the config frame and refresh the
tree after archival.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-24 11:46:46 +02:00
.cargo chore: apply rustfmt and commit .cargo/config.toml 2026-05-19 20:06:03 +02:00
.forgejo/workflows integrate fixes from integration: delete_audio metadata, Silero VAD bootstrap, prune connection-status.css ref, single gnu lab-release workflow 2026-06-24 00:42:34 +00:00
crates feat: stream VAD segments to silence-stripped Ogg/Opus archiver 2026-06-24 11:46:46 +02:00
docs feat: stream VAD segments to silence-stripped Ogg/Opus archiver 2026-06-24 11:46:46 +02:00
wasm/kws-vad Move STT/TTS to sherpa-onnx fork, add kws+vad wasm scaffold 2026-05-07 01:16:48 +00:00
.gitignore refactor: consolidate hero_voice crate into hero_voice_server with multi-domain RPC split 2026-06-24 09:57:14 +02:00
Cargo.lock refactor: consolidate hero_voice crate into hero_voice_server with multi-domain RPC split 2026-06-24 09:57:14 +02:00
Cargo.toml refactor: consolidate hero_voice crate into hero_voice_server with multi-domain RPC split 2026-06-24 09:57:14 +02:00
LICENSE feat: align hero_voice workspace structure with hero_services reference pattern 2026-02-24 11:16:17 +02:00
PURPOSE.md docs: bring schema/README updates and Silero VAD note from integration 2026-06-24 00:40:39 +00:00
README.md docs: bring schema/README updates and Silero VAD note from integration 2026-06-24 00:40:39 +00:00
request_logs.db Switch to direct AI client from hero_proc secret store 2026-05-16 00:19:27 +00:00

Hero Voice

Drop-in voice-to-text widgets for any host UI in the Hero ecosystem, backed by a local STT/TTS provider with a cloud Groq fallback and an AI text-transform pipeline.

Features

  • Drop-in browser widgets<hero-voice-input>, <hero-voice-floating>, <hero-voice-button>, and a data-hero-voice boost for any text input. See Browser widgets below.
  • Click-bounded one-shot captureMediaRecorder posts an Opus blob; the server transcodes to 16 kHz mono WAV.
  • Streaming dictation — full-duplex WebSocket sends Opus/PCM frames up, VAD-segmented transcript segments stream back in real time.
  • Local-first STT/TTShero_voice_provider runs sherpa-onnx Parakeet (STT) and Kokoro (TTS) locally; Groq cloud takes over on failure.
  • Text transformations — 14 built-in AI transformation styles:
    • spellcheck - Grammar and spelling correction
    • specs - Technical specifications
    • code - Software architecture documentation
    • docs - User-friendly documentation
    • legal - Legal document formatting
    • story - Creative narrative
    • summary - Bullet-point summary
    • technical - Technical documentation
    • business - Business analysis
    • meeting - Meeting minutes
    • email - Professional email
    • Language translations: Dutch, French, Arabic
  • Topic organization — Hierarchical folder structure for transcriptions
  • Audio archival — Saves recordings as WAV and compressed OGG
  • Settings & presets — Switch between local and Groq providers, per-user custom vocabulary

Browser widgets

The widgets live at crates/hero_voice_web/static/voice-widget/ (and crates/hero_voice_admin/static/voice-widget/), served under /hero_voice/web/voice-widget/ and /hero_voice/admin/voice-widget/. Each one is a vanilla custom element with no framework dependency.

<!-- 1. Mic button bound to a target field -->
<hero-voice-input target="#desc"></hero-voice-input>
<textarea id="desc"></textarea>

<!-- 2. Boost any input — mic appears on hover/focus -->
<input data-hero-voice />

<!-- 3. Fixed corner mic — fills the last-focused input -->
<hero-voice-floating position="bottom-right"></hero-voice-floating>

<!-- 4. Event-only mic — emits `hero:voice-text`, calls window.fn -->
<hero-voice-button on-text="window.onVoiceText"></hero-voice-button>

Pull in the scripts (use /hero_voice/web/ or /hero_voice/admin/ depending on which UI you're embedding in):

<link rel="stylesheet" href="/hero_voice/{web,admin}/voice-widget/voice-widget.css" />
<script src="/hero_voice/{web,admin}/voice-widget/components.js"></script>
<script src="/hero_voice/{web,admin}/voice-widget/floating.js"></script>
<script src="/hero_voice/{web,admin}/voice-widget/boost.js"></script>

A standalone demo lives at /hero_voice/admin/voice-widget/test.html.

Requirements

  • Rust 1.96+
  • A running hero_proc (service lifecycle and secret store for AI keys)
  • GROQ_API_KEY environment variable — required for the cloud STT/TTS fallback
  • OPENROUTER_API_KEY environment variable — required for transform_content / transform_text
  • Modern browser with MediaRecorder + microphone support (see Browser Support)

Configuration

AI provider keys are stored in the hero_proc secret store:

hero_proc secret set GROQ_API_KEY gsk_...
hero_proc secret set OPENROUTER_API_KEY sk-or-...

Optional environment variables:

Var Default Purpose
RUST_LOG (unset) Tracing filter, e.g. hero_voice=info
HERO_VOICE_SHERPA_DIR ~/hero/share/hero_voice/voice-widget/sherpa Browser-side sherpa WASM/data dir (parked wake-word bundle)
HERO_VOICE_VAD_DIR ~/hero/share/hero_voice/vad Silero VAD model dir for live streaming dictation
HERO_VOICE_PROVIDER_URL (unset) Delegate STT/TTS to a shared admin hero_voice_provider engine
HERO_VOICE_PROVIDER_TOKEN (unset) Bearer token for the shared engine
HERO_VOICE_PROVIDER_CONTEXT (unset) Per-tester context string for the shared engine

Streaming dictation needs the Silero VAD model

The streaming transcription path (/hero_voice/rest/transcribe/ws/..., used by the browser mic button) segments speech locally with a small (~0.6 MB) Silero VAD model before sending each segment to the STT engine. hero_voice_server fetches this model on startup via vad::ensure_silero_bundle() (idempotent, a no-op once present, non-fatal if the download fails). This matters on a member instance that uses a shared STT engine: it still needs its own copy of this VAD model to do live segmentation. Without it the server falls back to transcribing the whole utterance only when the user presses Stop. Manual recovery: drop silero_vad.onnx into ~/hero/share/hero_voice/vad/ and restart hero_voice_server.

Usage

lab service voice --start

Services listen on Unix sockets only (no TCP). Use hero_router for browser access.

Sockets

Socket Mount via hero_router Purpose
hero_voice/rpc.sock /hero_voice/rpc/ JSON-RPC 2.0 (26 domain methods + SSE events)
hero_voice/rest.sock /hero_voice/rest/ WebSocket streaming transcription
hero_voice/web.sock /hero_voice/web/ End-user web UI, widget assets, file downloads, MCP
hero_voice/admin.sock /hero_voice/admin/ Admin dashboard UI

Architecture

hero_voice/
├── crates/
│   ├── hero_voice/           # CLI lifecycle manager (start/stop via hero_proc)
│   ├── hero_voice_server/    # JSON-RPC 2.0 server (rpc.sock) + WebSocket streaming (rest.sock)
│   ├── hero_voice_web/       # End-user web UI (web.sock)
│   ├── hero_voice_admin/     # Admin dashboard UI (admin.sock)
│   ├── hero_voice_sdk/       # Generated OpenRPC client SDK
│   ├── hero_voice_app/       # Dioxus WASM app (experimental)
│   └── hero_voice_examples/  # Example programs using the SDK
├── docs/schemas/             # Domain schema documentation
├── wasm/                     # Browser-side WASM build (parked wake-word bundle)
└── Cargo.toml

Data flow

Browser widget (MediaRecorder → Opus blob)
  │
  ├─ One-shot STT ───────────▶ rpc.sock (VoiceService.transcribe via JSON-RPC)
  │                              │
  ├─ Streaming dictation ────▶ rest.sock (WebSocket: /transcribe/ws/{session_id})
  │                              │
  │                              ├── hero_voice_provider (local, priority 0)
  │                              └── Groq Whisper (cloud fallback)
  │
  ├─ Web UI / widgets ───────▶ web.sock (hero_voice_web)
  │     ├── /voice-widget/*       widget bundle
  │     ├── /files/audio/*        audio downloads
  │     ├── /files/transforms/*   transform downloads
  │     └── /mcp                  MCP-to-OpenRPC proxy
  │
  └─ Admin dashboard ────────▶ admin.sock (hero_voice_admin)

hero_voice_server (rpc.sock + rest.sock)
  ├── VoiceService.*              domain methods (26 total)
  ├── /transcribe/ws/{sid}        full-duplex WebSocket streaming
  └── /events?session_id=...      SSE transcript events

API

JSON-RPC Endpoint

All data operations use JSON-RPC 2.0 at /hero_voice/rpc/rpc — served by hero_router directly from rpc.sock.

VoiceService methods (26 total):

Category Methods
Topics create_topic, list_all_topics, rename_topic, move_topic, delete_topic, reset_topic
Folders create_folder, list_all_folders, rename_folder, move_folder, delete_folder
Content save_content, transform_content, transform_text
Audio register_audio, delete_audio, get_audio_path
STT/TTS transcribe, transcribe_path, synthesize_speech
Settings get_settings, set_settings, list_presets, apply_preset
Dictionary get_user_dictionary, set_user_dictionary

Streaming transcription

GET /hero_voice/rest/transcribe/ws/{session_id} — full-duplex WebSocket. Send a config text frame, then binary Opus/PCM frames, then {"type":"end"}. The backend Silero VAD segments speech and streams {"text":…} back per segment, finishing with {"done":true}.

Static Files

  • GET /hero_voice/{web,admin}/files/audio/{filename} — Audio file downloads
  • GET /hero_voice/{web,admin}/files/transforms/{filename} — Transform file downloads

Audio Processing

  • Capture: Browser MediaRecorder, Opus in Ogg (Firefox) or WebM (Chromium / Safari 18.4+) container, click-bounded one-shot per recording.
  • Server-side transcode: Opus → 16 kHz mono WAV before STT.
  • Archival: Saved recordings stored as WAV plus compressed OGG Vorbis (~10% of WAV size).

Browser Support

  • Chrome 120+
  • Firefox 120+
  • Safari 18.4+ (Opus in MediaRecorder; older Safari falls back to MP4 which the server doesn't currently decode)
  • Edge 120+

Requires microphone permission.

Embedding & CORS

Hero Voice allows iframe embedding (no X-Frame-Options restrictions), cross-origin API calls, and WebSocket connections from any origin.

License

Apache-2.0