- JavaScript 50.5%
- Rust 34.9%
- HTML 9.1%
- CSS 4.6%
- CMake 0.7%
- Other 0.2%
|
Some checks failed
lab release / release (push) Failing after 16s
Replace the full-session WAV buffer approach (decode everything, write on `end`) with a background `SpeechOggArchiver` that receives completed VAD speech segments over a channel and re-encodes them incrementally to one Ogg/Opus file. Because only speech segments are appended — never inter-segment silence — the archived file is silence-stripped. Adds `OggOpusRecorder::start_at` (caller-controlled path), the new `SpeechOggArchiver` struct with `append`/`finish` API and a round- trip test, replaces `service::save_wav_recording` with `new_recording_path`, and updates the WebSocket handler to send `topic` in the config frame and refresh the tree after archival. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|---|---|---|
| .cargo | ||
| .forgejo/workflows | ||
| crates | ||
| docs | ||
| wasm/kws-vad | ||
| .gitignore | ||
| Cargo.lock | ||
| Cargo.toml | ||
| LICENSE | ||
| PURPOSE.md | ||
| README.md | ||
| request_logs.db | ||
Hero Voice
Drop-in voice-to-text widgets for any host UI in the Hero ecosystem, backed by a local STT/TTS provider with a cloud Groq fallback and an AI text-transform pipeline.
Features
- Drop-in browser widgets —
<hero-voice-input>,<hero-voice-floating>,<hero-voice-button>, and adata-hero-voiceboost for any text input. See Browser widgets below. - Click-bounded one-shot capture —
MediaRecorderposts an Opus blob; the server transcodes to 16 kHz mono WAV. - Streaming dictation — full-duplex WebSocket sends Opus/PCM frames up, VAD-segmented transcript segments stream back in real time.
- Local-first STT/TTS —
hero_voice_providerruns sherpa-onnx Parakeet (STT) and Kokoro (TTS) locally; Groq cloud takes over on failure. - Text transformations — 14 built-in AI transformation styles:
spellcheck- Grammar and spelling correctionspecs- Technical specificationscode- Software architecture documentationdocs- User-friendly documentationlegal- Legal document formattingstory- Creative narrativesummary- Bullet-point summarytechnical- Technical documentationbusiness- Business analysismeeting- Meeting minutesemail- Professional email- Language translations: Dutch, French, Arabic
- Topic organization — Hierarchical folder structure for transcriptions
- Audio archival — Saves recordings as WAV and compressed OGG
- Settings & presets — Switch between local and Groq providers, per-user custom vocabulary
Browser widgets
The widgets live at crates/hero_voice_web/static/voice-widget/ (and
crates/hero_voice_admin/static/voice-widget/), served under
/hero_voice/web/voice-widget/ and /hero_voice/admin/voice-widget/.
Each one is a vanilla custom element with no framework dependency.
<!-- 1. Mic button bound to a target field -->
<hero-voice-input target="#desc"></hero-voice-input>
<textarea id="desc"></textarea>
<!-- 2. Boost any input — mic appears on hover/focus -->
<input data-hero-voice />
<!-- 3. Fixed corner mic — fills the last-focused input -->
<hero-voice-floating position="bottom-right"></hero-voice-floating>
<!-- 4. Event-only mic — emits `hero:voice-text`, calls window.fn -->
<hero-voice-button on-text="window.onVoiceText"></hero-voice-button>
Pull in the scripts (use /hero_voice/web/ or /hero_voice/admin/ depending on which UI you're embedding in):
<link rel="stylesheet" href="/hero_voice/{web,admin}/voice-widget/voice-widget.css" />
<script src="/hero_voice/{web,admin}/voice-widget/components.js"></script>
<script src="/hero_voice/{web,admin}/voice-widget/floating.js"></script>
<script src="/hero_voice/{web,admin}/voice-widget/boost.js"></script>
A standalone demo lives at
/hero_voice/admin/voice-widget/test.html.
Requirements
- Rust 1.96+
- A running
hero_proc(service lifecycle and secret store for AI keys) GROQ_API_KEYenvironment variable — required for the cloud STT/TTS fallbackOPENROUTER_API_KEYenvironment variable — required fortransform_content/transform_text- Modern browser with
MediaRecorder+ microphone support (see Browser Support)
Configuration
AI provider keys are stored in the hero_proc secret store:
hero_proc secret set GROQ_API_KEY gsk_...
hero_proc secret set OPENROUTER_API_KEY sk-or-...
Optional environment variables:
| Var | Default | Purpose |
|---|---|---|
RUST_LOG |
(unset) | Tracing filter, e.g. hero_voice=info |
HERO_VOICE_SHERPA_DIR |
~/hero/share/hero_voice/voice-widget/sherpa |
Browser-side sherpa WASM/data dir (parked wake-word bundle) |
HERO_VOICE_VAD_DIR |
~/hero/share/hero_voice/vad |
Silero VAD model dir for live streaming dictation |
HERO_VOICE_PROVIDER_URL |
(unset) | Delegate STT/TTS to a shared admin hero_voice_provider engine |
HERO_VOICE_PROVIDER_TOKEN |
(unset) | Bearer token for the shared engine |
HERO_VOICE_PROVIDER_CONTEXT |
(unset) | Per-tester context string for the shared engine |
Streaming dictation needs the Silero VAD model
The streaming transcription path (/hero_voice/rest/transcribe/ws/..., used by
the browser mic button) segments speech locally with a small (~0.6 MB) Silero
VAD model before sending each segment to the STT engine. hero_voice_server
fetches this model on startup via vad::ensure_silero_bundle() (idempotent, a
no-op once present, non-fatal if the download fails). This matters on a member
instance that uses a shared STT engine: it still needs its own copy of this VAD
model to do live segmentation. Without it the server falls back to transcribing
the whole utterance only when the user presses Stop. Manual recovery: drop
silero_vad.onnx into ~/hero/share/hero_voice/vad/ and restart
hero_voice_server.
Usage
lab service voice --start
Services listen on Unix sockets only (no TCP). Use hero_router for browser access.
Sockets
| Socket | Mount via hero_router | Purpose |
|---|---|---|
hero_voice/rpc.sock |
/hero_voice/rpc/ |
JSON-RPC 2.0 (26 domain methods + SSE events) |
hero_voice/rest.sock |
/hero_voice/rest/ |
WebSocket streaming transcription |
hero_voice/web.sock |
/hero_voice/web/ |
End-user web UI, widget assets, file downloads, MCP |
hero_voice/admin.sock |
/hero_voice/admin/ |
Admin dashboard UI |
Architecture
hero_voice/
├── crates/
│ ├── hero_voice/ # CLI lifecycle manager (start/stop via hero_proc)
│ ├── hero_voice_server/ # JSON-RPC 2.0 server (rpc.sock) + WebSocket streaming (rest.sock)
│ ├── hero_voice_web/ # End-user web UI (web.sock)
│ ├── hero_voice_admin/ # Admin dashboard UI (admin.sock)
│ ├── hero_voice_sdk/ # Generated OpenRPC client SDK
│ ├── hero_voice_app/ # Dioxus WASM app (experimental)
│ └── hero_voice_examples/ # Example programs using the SDK
├── docs/schemas/ # Domain schema documentation
├── wasm/ # Browser-side WASM build (parked wake-word bundle)
└── Cargo.toml
Data flow
Browser widget (MediaRecorder → Opus blob)
│
├─ One-shot STT ───────────▶ rpc.sock (VoiceService.transcribe via JSON-RPC)
│ │
├─ Streaming dictation ────▶ rest.sock (WebSocket: /transcribe/ws/{session_id})
│ │
│ ├── hero_voice_provider (local, priority 0)
│ └── Groq Whisper (cloud fallback)
│
├─ Web UI / widgets ───────▶ web.sock (hero_voice_web)
│ ├── /voice-widget/* widget bundle
│ ├── /files/audio/* audio downloads
│ ├── /files/transforms/* transform downloads
│ └── /mcp MCP-to-OpenRPC proxy
│
└─ Admin dashboard ────────▶ admin.sock (hero_voice_admin)
hero_voice_server (rpc.sock + rest.sock)
├── VoiceService.* domain methods (26 total)
├── /transcribe/ws/{sid} full-duplex WebSocket streaming
└── /events?session_id=... SSE transcript events
API
JSON-RPC Endpoint
All data operations use JSON-RPC 2.0 at /hero_voice/rpc/rpc — served by
hero_router directly from rpc.sock.
VoiceService methods (26 total):
| Category | Methods |
|---|---|
| Topics | create_topic, list_all_topics, rename_topic, move_topic, delete_topic, reset_topic |
| Folders | create_folder, list_all_folders, rename_folder, move_folder, delete_folder |
| Content | save_content, transform_content, transform_text |
| Audio | register_audio, delete_audio, get_audio_path |
| STT/TTS | transcribe, transcribe_path, synthesize_speech |
| Settings | get_settings, set_settings, list_presets, apply_preset |
| Dictionary | get_user_dictionary, set_user_dictionary |
Streaming transcription
GET /hero_voice/rest/transcribe/ws/{session_id} — full-duplex WebSocket.
Send a config text frame, then binary Opus/PCM frames, then {"type":"end"}.
The backend Silero VAD segments speech and streams {"text":…} back per
segment, finishing with {"done":true}.
Static Files
GET /hero_voice/{web,admin}/files/audio/{filename}— Audio file downloadsGET /hero_voice/{web,admin}/files/transforms/{filename}— Transform file downloads
Audio Processing
- Capture: Browser
MediaRecorder, Opus in Ogg (Firefox) or WebM (Chromium / Safari 18.4+) container, click-bounded one-shot per recording. - Server-side transcode: Opus → 16 kHz mono WAV before STT.
- Archival: Saved recordings stored as WAV plus compressed OGG Vorbis (~10% of WAV size).
Browser Support
- Chrome 120+
- Firefox 120+
- Safari 18.4+ (Opus in MediaRecorder; older Safari falls back to MP4 which the server doesn't currently decode)
- Edge 120+
Requires microphone permission.
Embedding & CORS
Hero Voice allows iframe embedding (no X-Frame-Options restrictions), cross-origin API calls, and WebSocket connections from any origin.
License
Apache-2.0