Voice transcription is messy. Even the best models like Whisper faithfully reproduce every “um”, “uh”, and rambling run-on sentence. That’s correct behavior for transcription, but not what you want when texting someone.
I added a “polish mode” to my macOS speech-to-text app that optionally sends Whisper’s output through an LLM to clean it up. The interaction model: hold Fn to record, tap Ctrl anytime during recording to enable polish, release to transcribe and paste.
The Modifier Key Challenge
The obvious approach - require Ctrl held simultaneously with Fn - felt clunky in testing. You’d have to coordinate two fingers before speaking, and the physical position is awkward.
A “latch” pattern works better: pressing Ctrl anytime while Fn is held latches the polish flag. You can press Ctrl before speaking, during, or just before release. The flag resets when you start a new recording.
let ctrl_latched = Arc::new(AtomicBool::new(false));
// In the event tap callback:
if key_pressed && !prev_pressed {
// Recording started - reset latch
ctrl_latched.store(false, Ordering::SeqCst);
start_recording(&state);
} else if !key_pressed && prev_pressed {
// Recording stopped - check if Ctrl was ever pressed
let polish = ctrl_latched.load(Ordering::SeqCst);
stop_recording(&state, polish);
}
// Latch Ctrl if pressed anytime during recording
if key_pressed && ctrl_pressed {
ctrl_latched.store(true, Ordering::SeqCst);
}
The macOS CGEventFlags expose modifier state as bitmasks. Control is 0x40000:
const CONTROL_KEY_FLAG: u64 = 0x40000;
let flags = event.get_flags().bits();
let ctrl_pressed = (flags & CONTROL_KEY_FLAG) != 0;
The Polish Function
The polish step is a straightforward LLM API call. I’m using Groq’s hosted llama-3.3-70b-versatile because I’m already using Groq for Whisper transcription - one API key, one vendor.
fn polish_text(text: &str, api_key: &str) -> Option<String> {
let client = reqwest::blocking::Client::new();
let body = serde_json::json!({
"model": "llama-3.3-70b-versatile",
"messages": [
{
"role": "system",
"content": "Clean up this voice message for texting. Remove filler words (um, uh, like, you know). Fix punctuation and sentence structure. Break up run-on sentences. Keep it casual. No trailing period. Output ONLY the cleaned text - no explanations, no quotes."
},
{
"role": "user",
"content": text
}
],
"temperature": 0.2
});
let response = client
.post("https://api.groq.com/openai/v1/chat/completions")
.header("Authorization", format!("Bearer {}", api_key))
.header("Content-Type", "application/json")
.json(&body)
.timeout(Duration::from_secs(30))
.send()
.ok()?;
if !response.status().is_success() {
return None;
}
let chat_response: ChatResponse = response.json().ok()?;
chat_response.choices.first().map(|c| c.message.content.clone())
}
The function returns Option<String> - this matters for the fallback logic.
Parsing the Response
Groq uses the OpenAI-compatible chat completions format. The response structure:
#[derive(serde::Deserialize)]
struct ChatResponse {
choices: Vec<ChatChoice>,
}
#[derive(serde::Deserialize)]
struct ChatChoice {
message: ChatMessage,
}
#[derive(serde::Deserialize)]
struct ChatMessage {
content: String,
}
Using serde to parse into typed structs catches malformed responses at parse time rather than panicking on field access later.
Prompt Engineering Lessons
The system prompt went through several iterations:
First attempt: “Clean up this transcription.”
Problem: The LLM would respond conversationally. “Sure! Here’s the cleaned up version: …”
Second attempt: “Output only the cleaned text.”
Problem: It would wrap the output in quotes: "Here's what I meant to say"
Third attempt: Added explicit prohibitions.
Output ONLY the cleaned text - no explanations, no quotes.
This worked. The key insight: LLMs default to being helpful and conversational. For tool use, you need to explicitly tell them to suppress that behavior.
Other prompt decisions:
- “Keep it casual” - prevents the LLM from making the text overly formal
- “No trailing period” - texting convention; a period at the end feels curt
- “Break up run-on sentences” - spoken language naturally runs together
Low temperature (0.2) keeps output consistent. Higher temperatures occasionally produced creative reinterpretations of what I said.
Graceful Degradation
The polish step can fail: network issues, rate limits, API changes. The user still expects their transcription to paste.
let final_text = if polish {
polish_text(text, api_key).unwrap_or_else(|| text.to_string())
} else {
text.to_string()
};
Option::unwrap_or_else is the right pattern here. If polish fails for any reason, fall back to the raw Whisper transcription. The user gets something rather than nothing.
This is a general principle for LLM features: treat them as enhancements, not requirements. The core functionality should work without them.
Latency Considerations
Polish adds a second API call, roughly 200-400ms on Groq. For a texting use case, this is acceptable - you’re not in a real-time conversation. For live captioning or dictation into a text field, it would be too slow.
The transcription already happens in a background thread:
thread::spawn(move || {
transcribe_and_paste(audio_data, sample_rate, &api_key, polish);
});
Both the Whisper call and the polish call happen sequentially in this thread. The UI remains responsive; the user just waits slightly longer for paste.
Trade-offs
When polish helps:
- Texting, where filler words and run-ons look sloppy
- Drafting messages you want to sound more coherent
- Quick notes that benefit from basic cleanup
When to skip it:
- Dictating into forms or code comments
- When you want exact transcription (quotes, interviews)
- Low-latency scenarios
What polish can break:
- Proper nouns and technical terms may get “corrected”
- The LLM might misinterpret intent on ambiguous input
- Short inputs (“ok”, “yes”) sometimes get expanded unnecessarily
The latch pattern makes this an explicit user choice. Default is raw transcription; polish is opt-in.
Conclusion
- Latch pattern beats simultaneous press - let users enable modes at any point during an action
- Explicit prompt constraints - tell the LLM what NOT to do (no explanations, no quotes)
- Low temperature for tools - you want consistency, not creativity
- Graceful fallback is mandatory - LLM features should enhance, not gate, core functionality
- Choose your latency budget - 200-400ms is fine for async use cases, not for real-time
Related: Building an AI-Powered Changelog GitHub Action - Similar prompt engineering patterns for developer tooling.