- Rust 66.9%
- JavaScript 22.4%
- HTML 7.1%
- CSS 2.1%
- CMake 1.1%
- Other 0.3%
|
|
||
|---|---|---|
| .cargo | ||
| .forgejo/workflows | ||
| .hero | ||
| crates | ||
| docs/schemas | ||
| schemas/voice | ||
| wasm/kws-vad | ||
| .gitignore | ||
| Cargo.lock | ||
| Cargo.toml | ||
| Cargo.toml.hero_builder_backup | ||
| LICENSE | ||
| PURPOSE.md | ||
| README.md | ||
Hero Voice
Voice-to-markdown transcription server with real-time speech recognition, AI-powered text transformation, and live preview.
Features
- Real-time voice transcription - Stream audio from browser to server via WebSocket
- Voice Activity Detection - Silero VAD V5 neural network detects speech/silence transitions
- Automatic segmentation - Transcribes on natural pauses (350ms silence threshold)
- AI transcription - Uses Groq WhisperLargeV3Turbo with automatic failover
- Live markdown preview - Split-screen editor with real-time rendered HTML
- Text transformations - 14 built-in AI transformation styles:
spellcheck- Grammar and spelling correctionspecs- Technical specificationscode- Software architecture documentationdocs- User-friendly documentationlegal- Legal document formattingstory- Creative narrativesummary- Bullet-point summarytechnical- Technical documentationbusiness- Business analysismeeting- Meeting minutesemail- Professional email- Language translations: Dutch, French, Arabic
- Topic organization - Hierarchical folder structure for transcriptions
- Audio archival - Saves recordings as WAV and compressed OGG
Requirements
- Rust 1.92+
- Groq API key (required for transcription)
- Modern browser with Web Audio API and microphone support
Configuration
# Required
export GROQ_API_KEY=your-groq-api-key
# Optional fallback providers
export OPENROUTER_API_KEY=your-openrouter-key
export SAMBANOVA_API_KEY=your-sambanova-key
# Server configuration (optional)
export RUST_LOG=hero_voice=info # Log level
Usage
service voice start --update --reset
Services listen on Unix sockets only (no TCP). Use hero_proxy for external access.
Sockets
| Service | Socket Path |
|---|---|
| Server (OpenRPC) | ~/hero/var/sockets/hero_voice/rpc.sock |
| UI (HTTP + /rpc proxy + WebSocket) | ~/hero/var/sockets/hero_voice/web.sock |
hero_voiced — local OpenAI-compatible STT/TTS daemon
hero_voiced is a stateless TCP daemon that loads sherpa-onnx Parakeet (STT) and
Kokoro (TTS) once and exposes them over an OpenAI-compatible API. It's
designed to be auto-discovered and registered as a priority-0 backend by
hero_aibroker, so that any consumer using herolib_ai against the broker
gets local inference for free, with cloud fallback (Groq, etc.) handled by
the broker — not the daemon.
Endpoints:
POST /v1/audio/transcriptions— multipart form (file,model,language,prompt,response_format). Default response{"text": "..."}.POST /v1/audio/speech— JSON{model, input, voice, response_format, speed}. Supportsresponse_formatofwav(default) andpcm.GET /v1/models— local engine identifiers.GET /health—{status, service, version, models_ready}.GET /.well-known/heroservice.json— discovery manifest.
Environment:
| Var | Default | Purpose |
|---|---|---|
HERO_VOICED_PORT |
8094 |
Loopback TCP port |
HERO_VOICED_ADDRESS |
(unset) | Optional second bind (e.g. mycelium IPv6) |
HERO_VOICE_STT_SHERPA_DIR |
~/hero/share/hero_voice/stt/parakeet |
Parakeet bundle dir |
HERO_VOICE_TTS_KOKORO_DIR |
~/hero/share/hero_voice/kokoro-en-v0_19 |
Kokoro bundle dir |
Both bundle dirs auto-populate on first hero_voiced start (~770 MB combined
download from the sherpa-onnx GitHub releases). make parakeet-deps /
make tts-deps / make model-deps remain available for offline pre-bake on
images and CI.
Run standalone:
make voiced
# or
cargo run -p hero_voiced
Architecture
Hero Voice follows the standard Hero three-crate model:
hero_voice/
├── crates/
│ ├── hero_voice/ # Core library (types, domain logic, audio, transcription)
│ ├── hero_voice_server/ # JSON-RPC 2.0 server over Unix socket
│ ├── hero_voice_sdk/ # Generated client SDK
│ ├── hero_voice_ui/ # Admin UI (Axum HTTP + /rpc proxy + WebSocket)
│ └── hero_voice_examples/ # Example programs using the SDK
├── schemas/voice/voice.oschema # Domain schema (source of truth)
├── data/ # Runtime data (OTOML storage, audio, transforms)
├── Cargo.toml
├── Makefile
└── buildenv.sh
Data flow
Browser (WebSocket)
│
▼
hero_voice_ui (Unix socket)
├── /rpc endpoint → proxies JSON-RPC to hero_voice_server.sock
├── /mcp endpoint → MCP-to-OpenRPC translation
├── /ws endpoint → WebSocket audio streaming
└── /* fallback → embedded static assets
hero_voice_server (Unix socket)
├── rpc.health → {"status":"ok"}
├── rpc.discover → OpenRPC spec
└── domain methods (folder.*, topic.*, voiceservice.*)
API
JSON-RPC Endpoint
All data operations use JSON-RPC 2.0 via the /rpc proxy on the UI socket.
Auto-generated CRUD (Topic and Folder root objects):
topic.new,topic.get,topic.set,topic.delete,topic.listfolder.new,folder.get,folder.set,folder.delete,folder.list
Custom service methods (VoiceService):
voiceservice.create_topic/voiceservice.create_foldervoiceservice.rename_topic/voiceservice.rename_foldervoiceservice.move_topic/voiceservice.move_foldervoiceservice.delete_topic/voiceservice.delete_foldervoiceservice.save_content/voiceservice.transform_contentvoiceservice.register_audio/voiceservice.delete_audiovoiceservice.reset_topic/voiceservice.get_audio_path
WebSocket
GET /ws - Audio streaming endpoint
Client to Server:
{ "type": "start", "topic": "optional-topic-sid", "audio_dir": "optional-dir" }
{ "type": "stop" }
Plus binary audio data (16-bit PCM, 16kHz, mono, little-endian)
Server to Client:
{ "type": "transcription", "text": "...", "is_final": true }
{ "type": "status", "message": "..." }
{ "type": "error", "message": "..." }
Static Files
GET /files/audio/{filename}- Audio file downloadsGET /files/transforms/{filename}- Transform file downloads
Audio Processing
- Sample rate: 16kHz (required for Silero VAD)
- Chunk size: 512 samples for VAD analysis
- Silence threshold: 350ms triggers transcription
- Speech threshold: 0.20 probability
- Maximum buffer: 30 seconds before forced transcription
- Compression: OGG Vorbis at quality 0.4 (~10% of WAV size)
Browser Support
- Chrome 120+
- Firefox 120+
- Safari 17+
- Edge 120+
Requires microphone permission and WebSocket support.
Embedding & CORS
Hero Voice allows iframe embedding (no X-Frame-Options restrictions), cross-origin API calls, and WebSocket connections from any origin.
License
Apache-2.0