add voice #2
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_slides#2
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
add a voice recorder to it
use the ai_client sdk from herolib
convert to use the fast voice to text : whisper_large_v3_turbo
use , to cleanup the voice to a usable transcription, without changing the meaning nor structure much, its cleanup to make it easier to edit, use mercury2 model
then in the editor, make button convert to instructions, which uses Gemini3_1FlashLitePreview
make clear prompt how to do this
Explain we need clean instructions for a coding agent
Implementation Spec for Issue #2 — Add Voice
Objective
Add a browser-based voice recorder to the slide editor that captures audio, sends it to the server for a two-stage AI pipeline (Whisper transcription → Mercury2 cleanup), and inserts the cleaned transcription into the slide markdown editor. A standalone "Convert to Instructions" button is added to the editor toolbar, which uses Gemini3_1FlashLitePreview to convert slide text into structured coding-agent instructions.
Requirements
voice.transcribe.whisper_large_v3_turbo— raw speech-to-text.Mercury2— light cleanup (fix filler words, punctuation, fragmented sentences) without changing meaning or structure.slide.toInstructionsRPC, which usesGemini3_1FlashLitePreviewto convert current textarea text into a structured list of coding-agent instructions.herolib_ai::AiClient(from herolib).voice.transcribeandslide.toInstructions.openrpc.jsonis updated to document both methods.Files to Modify / Create
Server —
crates/hero_slides_server/src/voice.rshandle_voice_transcribe— receives base64 audio, calls Whisper + Mercury2, returns cleaned textsrc/instructions.rshandle_slide_to_instructions— receives text, calls Gemini3_1FlashLitePreview with coding-agent promptsrc/rpc.rsvoice.transcribeandslide.toInstructionssrc/main.rsmod voiceandmod instructionsCargo.tomlbase64 = "0.22"dependencyopenrpc.jsonUI —
crates/hero_slides_ui/templates/index.htmlstatic/js/dashboard.jsconvertToInstructions()functionImplementation Plan
Step 1 — Create
src/voice.rsin the serverFiles:
crates/hero_slides_server/src/voice.rs,crates/hero_slides_server/Cargo.tomlbase64 = "0.22"to Cargo.toml.handle_voice_transcribe(params: &serde_json::Value) -> Result<serde_json::Value, String>.audio_data(base64 string) andfilename(default"recording.webm") from params.base64::engine::general_purpose::STANDARD.decode(...).AiClient::from_env().client.transcribe_bytes(TranscriptionModel::WhisperLargeV3Turbo, &audio_bytes, filename, TranscriptionOptions::new())— wrap intokio::task::spawn_blockingsinceherolib_aiis sync.response.textas raw transcription.spawn_blocking.json!({ "raw": raw_text, "cleaned": cleaned_text }).Dependencies: none (Step 1 is independent)
Step 2 — Create
src/instructions.rsin the serverFiles:
crates/hero_slides_server/src/instructions.rshandle_slide_to_instructions(params: &serde_json::Value) -> Result<serde_json::Value, String>.text: &strfrom params.AiClient::from_env().Gemini3_1FlashLitePreviewwith this system prompt:json!({ "instructions": result_text }).Dependencies: none (Step 2 is independent from Step 1)
Step 3 — Wire handlers into
src/rpc.rsandsrc/main.rsFiles:
crates/hero_slides_server/src/rpc.rs,crates/hero_slides_server/src/main.rsmod voice;andmod instructions;inmain.rs.rpc.rs:Dependencies: Steps 1 and 2
Step 4 — Update
openrpc.jsonFiles:
crates/hero_slides_server/openrpc.jsonAdd two new method objects to the
"methods"array consistent with existing style.Dependencies: none (independent)
Step 5 — Add voice recording UI in
templates/index.htmlFiles:
crates/hero_slides_ui/templates/index.htmlInside
.editor-actionsin the editor overlay toolbar, add:#btn-record-start— mic button (bootstrap iconbi-mic)#btn-record-stop— stop button (shown only while recording,bi-stop-circle)#btn-to-instructions— "To Instructions" button (bi-magic)#voice-status— small status text area below the editor paneDependencies: none (independent)
Step 6 — Add voice JS logic in
static/js/dashboard.jsFiles:
crates/hero_slides_ui/static/js/dashboard.jsAdd:
startRecording()— requests mic, createsMediaRecorder, starts recordingstopRecording()— stops recorder (triggerssendAudioToServer()viastopevent)sendAudioToServer()— converts audio blob to base64, callsrpc('voice.transcribe', ...), appends cleaned text to textarea, updates previewconvertToInstructions()— reads textarea, callsrpc('slide.toInstructions', ...), replaces textarea contentDependencies: Step 5 (buttons must exist in HTML)
Acceptance Criteria
voice.transcribeRPC returnsrawandcleanedtext fields.slide.toInstructionsRPC returns structuredinstructionstext.openrpc.jsondocuments both new methods.Notes
ureqbased) — all AI calls must be wrapped intokio::task::spawn_blockingin async Axum handlers.audio/webm;codecs=opus, Firefox recordsaudio/ogg;codecs=opus. Both are accepted by Groq's Whisper endpoint. Passfilenamewith correct extension from browser.OPENROUTER_API_KEY./rpcPOST endpoint.Build & Test Results
Build: ✅ Success
Tests: ✅ 12 passed / 0 failed
Details
hero_slides,hero_slides_sdk,hero_slides_server,hero_slides_ui,hero_slides_examples)hero_slides_server:generator: 3 tests (prompt building)parser: 3 tests (front matter / intent parsing)discovery: 3 tests (slide dir / theme discovery)hashing: 3 tests (hash consistency, metadata roundtrip, theme change detection)Warnings (non-breaking)
hero_slides_ui:LogBroadcaststruct and itsnew/sendmethods are defined but never usedhero_slides_examples: unused variablespecinbasic_usage.rsBuild time: ~52s (cold), ~7s (incremental for tests)
Implementation Complete ✅
Files Created
crates/hero_slides_server/src/voice.rs—handle_voice_transcribe: receives base64 audio, runs Whisper Large V3 Turbo transcription, then Mercury2 cleanup (preserving meaning/structure). Returns{ raw, cleaned }.crates/hero_slides_server/src/instructions.rs—handle_slide_to_instructions: converts text to structured coding-agent instructions using Gemini 3.1 Flash Lite Preview. Returns{ instructions }.Files Modified
crates/hero_slides_server/src/rpc.rs— Added dispatch forvoice.transcribeandslide.toInstructionscrates/hero_slides_server/src/main.rs— Declaredmod voiceandmod instructionscrates/hero_slides_server/Cargo.toml— Addedbase64 = "0.22"dependencycrates/hero_slides_server/openrpc.json— Added full method descriptors for both new RPC methodscrates/hero_slides_ui/templates/index.html— Added mic button, stop button, "To Instructions" button, and voice status indicator to editor toolbarcrates/hero_slides_ui/static/js/dashboard.js— AddedstartRecording(),stopRecording(),sendAudioToServer(),convertToInstructions()functionsAI Pipeline
whisper_large_v3_turbovia GroqMercury2to fix filler words/punctuation without changing meaning"Convert to Instructions" Button
Gemini3_1FlashLitePreviewwith a coding-agent-specific promptTest Results
Implementation committed:
7ac2996Browse:
7ac2996