add voice #2

Closed
opened 2026-03-25 05:51:04 +00:00 by despiegk · 4 comments
Owner

image

add a voice recorder to it

use the ai_client sdk from herolib

convert to use the fast voice to text : whisper_large_v3_turbo

use , to cleanup the voice to a usable transcription, without changing the meaning nor structure much, its cleanup to make it easier to edit, use mercury2 model

then in the editor, make button convert to instructions, which uses Gemini3_1FlashLitePreview
make clear prompt how to do this

Explain we need clean instructions for a coding agent

![image](/attachments/5a0bd141-95e0-4a98-8d1a-e6c89d41770d) add a voice recorder to it use the ai_client sdk from herolib convert to use the fast voice to text : whisper_large_v3_turbo use , to cleanup the voice to a usable transcription, without changing the meaning nor structure much, its cleanup to make it easier to edit, use mercury2 model then in the editor, make button convert to instructions, which uses Gemini3_1FlashLitePreview make clear prompt how to do this Explain we need clean instructions for a coding agent
724 KiB
Author
Owner

Implementation Spec for Issue #2 — Add Voice

Objective

Add a browser-based voice recorder to the slide editor that captures audio, sends it to the server for a two-stage AI pipeline (Whisper transcription → Mercury2 cleanup), and inserts the cleaned transcription into the slide markdown editor. A standalone "Convert to Instructions" button is added to the editor toolbar, which uses Gemini3_1FlashLitePreview to convert slide text into structured coding-agent instructions.


Requirements

  • A microphone/record button appears in the editor toolbar.
  • The browser records audio using the Web MediaRecorder API and sends raw audio bytes (base64-encoded) to the server via a new RPC method voice.transcribe.
  • The server pipeline runs two sequential AI steps:
    1. whisper_large_v3_turbo — raw speech-to-text.
    2. Mercury2 — light cleanup (fix filler words, punctuation, fragmented sentences) without changing meaning or structure.
  • The cleaned transcription is appended to the editor textarea.
  • A "Convert to Instructions" button in the editor toolbar triggers slide.toInstructions RPC, which uses Gemini3_1FlashLitePreview to convert current textarea text into a structured list of coding-agent instructions.
  • All AI calls go through herolib_ai::AiClient (from herolib).
  • Two new JSON-RPC methods are added: voice.transcribe and slide.toInstructions.
  • openrpc.json is updated to document both methods.

Files to Modify / Create

Server — crates/hero_slides_server/

File Action Purpose
src/voice.rs Create handle_voice_transcribe — receives base64 audio, calls Whisper + Mercury2, returns cleaned text
src/instructions.rs Create handle_slide_to_instructions — receives text, calls Gemini3_1FlashLitePreview with coding-agent prompt
src/rpc.rs Modify Add match arms for voice.transcribe and slide.toInstructions
src/main.rs Modify Declare mod voice and mod instructions
Cargo.toml Modify Add base64 = "0.22" dependency
openrpc.json Modify Add descriptors for both new methods

UI — crates/hero_slides_ui/

File Action Purpose
templates/index.html Modify Add mic, stop, and "Convert to Instructions" buttons to editor toolbar
static/js/dashboard.js Modify Add voice recording state machine + convertToInstructions() function

Implementation Plan

Step 1 — Create src/voice.rs in the server

Files: crates/hero_slides_server/src/voice.rs, crates/hero_slides_server/Cargo.toml

  • Add base64 = "0.22" to Cargo.toml.
  • Create handle_voice_transcribe(params: &serde_json::Value) -> Result<serde_json::Value, String>.
  • Extract audio_data (base64 string) and filename (default "recording.webm") from params.
  • Decode base64 bytes using base64::engine::general_purpose::STANDARD.decode(...).
  • Create AiClient::from_env().
  • Call client.transcribe_bytes(TranscriptionModel::WhisperLargeV3Turbo, &audio_bytes, filename, TranscriptionOptions::new()) — wrap in tokio::task::spawn_blocking since herolib_ai is sync.
  • Take response.text as raw transcription.
  • Call Mercury2 cleanup:
    system: "You are a transcription cleanup assistant. Fix filler words, punctuation errors, and fragmented sentences in the following voice transcription. Do NOT change the meaning, add new ideas, or restructure the content. Output only the cleaned transcription text with no commentary."
    user: <raw_text>
    model: Model::Mercury2
    
    Also wrapped in spawn_blocking.
  • Return json!({ "raw": raw_text, "cleaned": cleaned_text }).

Dependencies: none (Step 1 is independent)

Step 2 — Create src/instructions.rs in the server

Files: crates/hero_slides_server/src/instructions.rs

  • Create handle_slide_to_instructions(params: &serde_json::Value) -> Result<serde_json::Value, String>.
  • Extract text: &str from params.
  • Call AiClient::from_env().
  • Use Gemini3_1FlashLitePreview with this system prompt:
You are converting spoken notes or rough text into clean, structured instructions
for a coding agent (such as Claude Code or a similar AI programmer).

Rules:
- Produce a numbered or bulleted list of actionable instructions.
- Each instruction must be self-contained and unambiguous enough for a coding agent to execute.
- Do NOT add extra explanations, apologies, or meta-commentary.
- Preserve all technical terms, file names, function names, and identifiers exactly as given.
- If the input contains a question, convert it into an instruction ("Implement X so that Y").
- Output only the final instruction list — no preamble, no closing remarks.
  • Return json!({ "instructions": result_text }).

Dependencies: none (Step 2 is independent from Step 1)

Step 3 — Wire handlers into src/rpc.rs and src/main.rs

Files: crates/hero_slides_server/src/rpc.rs, crates/hero_slides_server/src/main.rs

  • Add mod voice; and mod instructions; in main.rs.
  • Add imports and match arms in rpc.rs:
    "voice.transcribe"      => voice::handle_voice_transcribe(&req.params).await,
    "slide.toInstructions"  => instructions::handle_slide_to_instructions(&req.params).await,
    

Dependencies: Steps 1 and 2

Step 4 — Update openrpc.json

Files: crates/hero_slides_server/openrpc.json

Add two new method objects to the "methods" array consistent with existing style.

Dependencies: none (independent)

Step 5 — Add voice recording UI in templates/index.html

Files: crates/hero_slides_ui/templates/index.html

Inside .editor-actions in the editor overlay toolbar, add:

  • #btn-record-start — mic button (bootstrap icon bi-mic)
  • #btn-record-stop — stop button (shown only while recording, bi-stop-circle)
  • #btn-to-instructions — "To Instructions" button (bi-magic)
  • #voice-status — small status text area below the editor pane

Dependencies: none (independent)

Step 6 — Add voice JS logic in static/js/dashboard.js

Files: crates/hero_slides_ui/static/js/dashboard.js

Add:

  • startRecording() — requests mic, creates MediaRecorder, starts recording
  • stopRecording() — stops recorder (triggers sendAudioToServer() via stop event)
  • sendAudioToServer() — converts audio blob to base64, calls rpc('voice.transcribe', ...), appends cleaned text to textarea, updates preview
  • convertToInstructions() — reads textarea, calls rpc('slide.toInstructions', ...), replaces textarea content

Dependencies: Step 5 (buttons must exist in HTML)


Acceptance Criteria

  • Server compiles without warnings.
  • voice.transcribe RPC returns raw and cleaned text fields.
  • slide.toInstructions RPC returns structured instructions text.
  • Mercury2 system prompt explicitly forbids meaning/structure changes.
  • Gemini system prompt explicitly frames task as coding-agent instructions.
  • Mic button visible in editor toolbar.
  • Recording toggles buttons correctly (start/stop).
  • Cleaned transcription is appended to the textarea after recording stops.
  • "Convert to Instructions" replaces textarea content with Gemini result.
  • openrpc.json documents both new methods.
  • No existing RPC methods are broken.
  • Errors surface as toast notifications.

Notes

  • herolib_ai is sync (ureq based) — all AI calls must be wrapped in tokio::task::spawn_blocking in async Axum handlers.
  • Audio format: Chrome records audio/webm;codecs=opus, Firefox records audio/ogg;codecs=opus. Both are accepted by Groq's Whisper endpoint. Pass filename with correct extension from browser.
  • base64: Check if already pulled in transitively; if so, pin to same version.
  • Mercury2 and Gemini are routed via OpenRouter — requires OPENROUTER_API_KEY.
  • No new Axum routes needed — both features dispatch through the existing /rpc POST endpoint.
## Implementation Spec for Issue #2 — Add Voice ### Objective Add a browser-based voice recorder to the slide editor that captures audio, sends it to the server for a two-stage AI pipeline (Whisper transcription → Mercury2 cleanup), and inserts the cleaned transcription into the slide markdown editor. A standalone "Convert to Instructions" button is added to the editor toolbar, which uses Gemini3_1FlashLitePreview to convert slide text into structured coding-agent instructions. --- ### Requirements - A microphone/record button appears in the editor toolbar. - The browser records audio using the Web MediaRecorder API and sends raw audio bytes (base64-encoded) to the server via a new RPC method `voice.transcribe`. - The server pipeline runs two sequential AI steps: 1. `whisper_large_v3_turbo` — raw speech-to-text. 2. `Mercury2` — light cleanup (fix filler words, punctuation, fragmented sentences) without changing meaning or structure. - The cleaned transcription is appended to the editor textarea. - A "Convert to Instructions" button in the editor toolbar triggers `slide.toInstructions` RPC, which uses `Gemini3_1FlashLitePreview` to convert current textarea text into a structured list of coding-agent instructions. - All AI calls go through `herolib_ai::AiClient` (from herolib). - Two new JSON-RPC methods are added: `voice.transcribe` and `slide.toInstructions`. - `openrpc.json` is updated to document both methods. --- ### Files to Modify / Create #### Server — `crates/hero_slides_server/` | File | Action | Purpose | |---|---|---| | `src/voice.rs` | **Create** | `handle_voice_transcribe` — receives base64 audio, calls Whisper + Mercury2, returns cleaned text | | `src/instructions.rs` | **Create** | `handle_slide_to_instructions` — receives text, calls Gemini3_1FlashLitePreview with coding-agent prompt | | `src/rpc.rs` | **Modify** | Add match arms for `voice.transcribe` and `slide.toInstructions` | | `src/main.rs` | **Modify** | Declare `mod voice` and `mod instructions` | | `Cargo.toml` | **Modify** | Add `base64 = "0.22"` dependency | | `openrpc.json` | **Modify** | Add descriptors for both new methods | #### UI — `crates/hero_slides_ui/` | File | Action | Purpose | |---|---|---| | `templates/index.html` | **Modify** | Add mic, stop, and "Convert to Instructions" buttons to editor toolbar | | `static/js/dashboard.js` | **Modify** | Add voice recording state machine + `convertToInstructions()` function | --- ### Implementation Plan #### Step 1 — Create `src/voice.rs` in the server **Files:** `crates/hero_slides_server/src/voice.rs`, `crates/hero_slides_server/Cargo.toml` - Add `base64 = "0.22"` to Cargo.toml. - Create `handle_voice_transcribe(params: &serde_json::Value) -> Result<serde_json::Value, String>`. - Extract `audio_data` (base64 string) and `filename` (default `"recording.webm"`) from params. - Decode base64 bytes using `base64::engine::general_purpose::STANDARD.decode(...)`. - Create `AiClient::from_env()`. - Call `client.transcribe_bytes(TranscriptionModel::WhisperLargeV3Turbo, &audio_bytes, filename, TranscriptionOptions::new())` — wrap in `tokio::task::spawn_blocking` since `herolib_ai` is sync. - Take `response.text` as raw transcription. - Call Mercury2 cleanup: ``` system: "You are a transcription cleanup assistant. Fix filler words, punctuation errors, and fragmented sentences in the following voice transcription. Do NOT change the meaning, add new ideas, or restructure the content. Output only the cleaned transcription text with no commentary." user: <raw_text> model: Model::Mercury2 ``` Also wrapped in `spawn_blocking`. - Return `json!({ "raw": raw_text, "cleaned": cleaned_text })`. **Dependencies:** none (Step 1 is independent) #### Step 2 — Create `src/instructions.rs` in the server **Files:** `crates/hero_slides_server/src/instructions.rs` - Create `handle_slide_to_instructions(params: &serde_json::Value) -> Result<serde_json::Value, String>`. - Extract `text: &str` from params. - Call `AiClient::from_env()`. - Use `Gemini3_1FlashLitePreview` with this system prompt: ``` You are converting spoken notes or rough text into clean, structured instructions for a coding agent (such as Claude Code or a similar AI programmer). Rules: - Produce a numbered or bulleted list of actionable instructions. - Each instruction must be self-contained and unambiguous enough for a coding agent to execute. - Do NOT add extra explanations, apologies, or meta-commentary. - Preserve all technical terms, file names, function names, and identifiers exactly as given. - If the input contains a question, convert it into an instruction ("Implement X so that Y"). - Output only the final instruction list — no preamble, no closing remarks. ``` - Return `json!({ "instructions": result_text })`. **Dependencies:** none (Step 2 is independent from Step 1) #### Step 3 — Wire handlers into `src/rpc.rs` and `src/main.rs` **Files:** `crates/hero_slides_server/src/rpc.rs`, `crates/hero_slides_server/src/main.rs` - Add `mod voice;` and `mod instructions;` in `main.rs`. - Add imports and match arms in `rpc.rs`: ```rust "voice.transcribe" => voice::handle_voice_transcribe(&req.params).await, "slide.toInstructions" => instructions::handle_slide_to_instructions(&req.params).await, ``` **Dependencies:** Steps 1 and 2 #### Step 4 — Update `openrpc.json` **Files:** `crates/hero_slides_server/openrpc.json` Add two new method objects to the `"methods"` array consistent with existing style. **Dependencies:** none (independent) #### Step 5 — Add voice recording UI in `templates/index.html` **Files:** `crates/hero_slides_ui/templates/index.html` Inside `.editor-actions` in the editor overlay toolbar, add: - `#btn-record-start` — mic button (bootstrap icon `bi-mic`) - `#btn-record-stop` — stop button (shown only while recording, `bi-stop-circle`) - `#btn-to-instructions` — "To Instructions" button (`bi-magic`) - `#voice-status` — small status text area below the editor pane **Dependencies:** none (independent) #### Step 6 — Add voice JS logic in `static/js/dashboard.js` **Files:** `crates/hero_slides_ui/static/js/dashboard.js` Add: - `startRecording()` — requests mic, creates `MediaRecorder`, starts recording - `stopRecording()` — stops recorder (triggers `sendAudioToServer()` via `stop` event) - `sendAudioToServer()` — converts audio blob to base64, calls `rpc('voice.transcribe', ...)`, appends cleaned text to textarea, updates preview - `convertToInstructions()` — reads textarea, calls `rpc('slide.toInstructions', ...)`, replaces textarea content **Dependencies:** Step 5 (buttons must exist in HTML) --- ### Acceptance Criteria - [ ] Server compiles without warnings. - [ ] `voice.transcribe` RPC returns `raw` and `cleaned` text fields. - [ ] `slide.toInstructions` RPC returns structured `instructions` text. - [ ] Mercury2 system prompt explicitly forbids meaning/structure changes. - [ ] Gemini system prompt explicitly frames task as coding-agent instructions. - [ ] Mic button visible in editor toolbar. - [ ] Recording toggles buttons correctly (start/stop). - [ ] Cleaned transcription is appended to the textarea after recording stops. - [ ] "Convert to Instructions" replaces textarea content with Gemini result. - [ ] `openrpc.json` documents both new methods. - [ ] No existing RPC methods are broken. - [ ] Errors surface as toast notifications. --- ### Notes - **herolib_ai is sync** (`ureq` based) — all AI calls must be wrapped in `tokio::task::spawn_blocking` in async Axum handlers. - **Audio format**: Chrome records `audio/webm;codecs=opus`, Firefox records `audio/ogg;codecs=opus`. Both are accepted by Groq's Whisper endpoint. Pass `filename` with correct extension from browser. - **base64**: Check if already pulled in transitively; if so, pin to same version. - **Mercury2 and Gemini** are routed via OpenRouter — requires `OPENROUTER_API_KEY`. - **No new Axum routes needed** — both features dispatch through the existing `/rpc` POST endpoint.
Author
Owner

Build & Test Results

Build: Success
Tests: 12 passed / 0 failed

Details

  • All workspace crates compiled successfully (hero_slides, hero_slides_sdk, hero_slides_server, hero_slides_ui, hero_slides_examples)
  • 12 unit tests passed in hero_slides_server:
    • generator: 3 tests (prompt building)
    • parser: 3 tests (front matter / intent parsing)
    • discovery: 3 tests (slide dir / theme discovery)
    • hashing: 3 tests (hash consistency, metadata roundtrip, theme change detection)
  • 0 tests failed

Warnings (non-breaking)

  • hero_slides_ui: LogBroadcast struct and its new/send methods are defined but never used
  • hero_slides_examples: unused variable spec in basic_usage.rs

Build time: ~52s (cold), ~7s (incremental for tests)

## Build & Test Results **Build:** ✅ Success **Tests:** ✅ 12 passed / 0 failed ### Details - All workspace crates compiled successfully (`hero_slides`, `hero_slides_sdk`, `hero_slides_server`, `hero_slides_ui`, `hero_slides_examples`) - 12 unit tests passed in `hero_slides_server`: - `generator`: 3 tests (prompt building) - `parser`: 3 tests (front matter / intent parsing) - `discovery`: 3 tests (slide dir / theme discovery) - `hashing`: 3 tests (hash consistency, metadata roundtrip, theme change detection) - 0 tests failed ### Warnings (non-breaking) - `hero_slides_ui`: `LogBroadcast` struct and its `new`/`send` methods are defined but never used - `hero_slides_examples`: unused variable `spec` in `basic_usage.rs` Build time: ~52s (cold), ~7s (incremental for tests)
Author
Owner

Implementation Complete

Files Created

  • crates/hero_slides_server/src/voice.rshandle_voice_transcribe: receives base64 audio, runs Whisper Large V3 Turbo transcription, then Mercury2 cleanup (preserving meaning/structure). Returns { raw, cleaned }.
  • crates/hero_slides_server/src/instructions.rshandle_slide_to_instructions: converts text to structured coding-agent instructions using Gemini 3.1 Flash Lite Preview. Returns { instructions }.

Files Modified

  • crates/hero_slides_server/src/rpc.rs — Added dispatch for voice.transcribe and slide.toInstructions
  • crates/hero_slides_server/src/main.rs — Declared mod voice and mod instructions
  • crates/hero_slides_server/Cargo.toml — Added base64 = "0.22" dependency
  • crates/hero_slides_server/openrpc.json — Added full method descriptors for both new RPC methods
  • crates/hero_slides_ui/templates/index.html — Added mic button, stop button, "To Instructions" button, and voice status indicator to editor toolbar
  • crates/hero_slides_ui/static/js/dashboard.js — Added startRecording(), stopRecording(), sendAudioToServer(), convertToInstructions() functions

AI Pipeline

  1. Record → Browser MediaRecorder captures audio (webm/ogg)
  2. Transcribe → Server calls whisper_large_v3_turbo via Groq
  3. Cleanup → Server calls Mercury2 to fix filler words/punctuation without changing meaning
  4. Insert → Cleaned text appended to slide editor textarea

"Convert to Instructions" Button

  • Takes current textarea content
  • Sends to Gemini3_1FlashLitePreview with a coding-agent-specific prompt
  • Replaces textarea with a structured, numbered instruction list

Test Results

  • Build: All 5 crates compiled successfully
  • Tests: 12/12 passed (0 failures)

--- ## Implementation Complete ✅ ### Files Created - `crates/hero_slides_server/src/voice.rs` — `handle_voice_transcribe`: receives base64 audio, runs Whisper Large V3 Turbo transcription, then Mercury2 cleanup (preserving meaning/structure). Returns `{ raw, cleaned }`. - `crates/hero_slides_server/src/instructions.rs` — `handle_slide_to_instructions`: converts text to structured coding-agent instructions using Gemini 3.1 Flash Lite Preview. Returns `{ instructions }`. ### Files Modified - `crates/hero_slides_server/src/rpc.rs` — Added dispatch for `voice.transcribe` and `slide.toInstructions` - `crates/hero_slides_server/src/main.rs` — Declared `mod voice` and `mod instructions` - `crates/hero_slides_server/Cargo.toml` — Added `base64 = "0.22"` dependency - `crates/hero_slides_server/openrpc.json` — Added full method descriptors for both new RPC methods - `crates/hero_slides_ui/templates/index.html` — Added mic button, stop button, "To Instructions" button, and voice status indicator to editor toolbar - `crates/hero_slides_ui/static/js/dashboard.js` — Added `startRecording()`, `stopRecording()`, `sendAudioToServer()`, `convertToInstructions()` functions ### AI Pipeline 1. **Record** → Browser MediaRecorder captures audio (webm/ogg) 2. **Transcribe** → Server calls `whisper_large_v3_turbo` via Groq 3. **Cleanup** → Server calls `Mercury2` to fix filler words/punctuation without changing meaning 4. **Insert** → Cleaned text appended to slide editor textarea ### "Convert to Instructions" Button - Takes current textarea content - Sends to `Gemini3_1FlashLitePreview` with a coding-agent-specific prompt - Replaces textarea with a structured, numbered instruction list ### Test Results - Build: ✅ All 5 crates compiled successfully - Tests: ✅ 12/12 passed (0 failures) ---
Author
Owner

Implementation committed: 7ac2996

Browse: 7ac2996

Implementation committed: `7ac2996` Browse: https://forge.ourworld.tf/lhumina_code/hero_slides/commit/7ac2996
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_slides#2
No description provided.