A developer's journey from typing fatigue to a polished macOS speech-to-text app, built entirely through AI-assisted pair programming.
The Spark: “How Could I Talk to Claude Instead of Typing?”
It was late on March 12, 2026. I was deep into an intense coding session — evaluating ByteDance’s DeerFlow project, running multi-AI audits, battling context window limits — and my fingers were tired. I had been chatting with Claude Code in VS Code for hours, typing everything out, when I asked the question that started it all:

I shot it down immediately:

But I had something more specific in mind:

V0 The Ten-Minute Prototype
Claude built the first version in minutes: a simple Python script that
records from the mic using sox, stops when you press Enter, runs
Whisper, and copies the result to clipboard.

V1 The Persistent Loop — And Its Fatal Flaw
Claude converted it to a persistent loop: press Enter to start recording, press Enter again to stop, repeat forever. I saw the flaw instantly:

Then I delivered the design spec that defined the entire project:

This is exactly how WeChat handles voice messages — hold to talk, release to send. Simple, intuitive, no conflict with other keyboard shortcuts.
V1.5 The Fn Key Daemon
Claude built a background daemon using Python’s Quartz framework to
create a CGEventTap — a low-level macOS event listener that detects Fn
key press and release. sox handles recording, Whisper handles
transcription, pbcopy handles clipboard.

I tested it with mixed Mandarin and English:

The core was done. But it was rough — a terminal daemon with no UI, no auto-start, and occasionally missing punctuation.
V2 From CLI Hack to Polished Mac App
I wanted two things: fix the missing punctuation, and turn it into a
proper macOS menu bar app that starts on login. Claude drafted a plan
using rumps (a Python library for menu bar apps), --initial_prompt
for Whisper punctuation hints, and a LaunchAgent plist for auto-start.
The Multi-AI Audit
Before implementing, I did something I do routinely for important features: I sent the plan to multiple AI reviewers for independent audit. Claude CLI and Kimi CLI both reviewed the plan — and both caught the same critical bug.
The bug: My plan assumed --language en would translate Chinese
speech to English. It doesn’t. It forces Whisper to treat all audio as
English, which means Chinese speech produces garbled nonsense. Actual
translation requires --task translate.
Furthermore, there is no Whisper flag for translating TO Chinese at all. The plan had a “Chinese translation” mode that was technically impossible. Both auditors flagged it independently.
The audit also surfaced 12 other issues: race conditions with overlapping recordings, deprecated notification APIs on macOS 14+, unsafe UI updates from background threads, silent CGEventTap disabling, and more. Every one was addressed before implementation.
The Bug Parade
The Swedish Ghost. I spoke a short phrase in Chinese and got back:
“Vad sager du?” — Swedish. With very short audio clips, Whisper’s
language detection goes haywire. The --initial_prompt helps bias it,
but sub-second clips can still confuse the model.
The Zombie PID. After a system reboot, voice-input refused to start.
The PID file contained PID 717, which was valid — but it belonged to
itunescloudd, Apple’s iCloud sync daemon. After a reboot, macOS reuses
PIDs, so os.kill(717, 0) returned success because a process existed
at that PID, just not ours.

The things you learn while debugging.
The Duplicate Menu Bar Icons. After enabling the LaunchAgent, I saw two microphone icons. I’d forgotten to quit the manually-launched instance before the LaunchAgent started a second one. I quit the wrong one, killing the working instance. Lesson learned: the LaunchAgent handles everything now.
Auto-Paste: Going One Step Further
Once basic transcription was solid, I pushed for more:

“Perfect, it’s running as we expected.”
Voice-Triggered Screenshots
The most creative feature came from a simple idea:

First test failed — “Can you take a screenshot for the moment?” didn’t match because the matching was too strict. Fixed by matching on just the word “screenshot” anywhere in the text. Second test failed — needed Screen Recording permission. After granting it, screenshots worked perfectly.
V3 Cloud STT — From 5 Seconds to 1
For five days, voice-input ran flawlessly with local Whisper. But the 3-5 second transcription delay nagged at me. On March 18, I decided to add cloud-based speech recognition.
The plan: try DashScope’s qwen3-asr-flash API first (Alibaba’s cloud
STT, ~1 second latency), fall back to local Whisper if offline or the
API fails.
The API key treasure hunt. The DashScope API key in GCP Secret Manager turned out to be expired. Claude searched the GCP VM, Docker containers, and various config files — all dead ends:

The implementation was clean: a new transcribe_cloud() function that
base64-encodes the WAV and calls DashScope’s multimodal API, with the
existing Whisper logic extracted into transcribe_local() as the
fallback. Notifications now show a prefix — ☁️ cloud or 💻 local — so I
always know which engine handled my voice.

From 5 seconds to 1. From local-only to cloud-first with offline fallback. From typing everything to just… talking.
The Architecture
What started as a 30-line script evolved into a 550-line macOS menu bar app:

What I Learned
AI-assisted development is genuinely powerful when iterative. The key wasn’t getting Claude to write the whole thing at once — it was the rapid feedback loop: describe what I want → Claude builds it → I test → describe what’s wrong → Claude fixes it → repeat. The entire V2 app was built in a single session.
Multi-AI audits catch real bugs. Having Claude and Kimi independently review the plan caught a Whisper API misunderstanding that would have shipped broken. Different AI models spot different things.
The best tools are born from frustration. I didn’t set out to build a speech-to-text app. I was tired of typing during a long coding session and asked a simple question. Six days later, I have a cloud-accelerated, bilingual, auto-pasting voice input tool that starts on boot and handles mixed English and Chinese.
Hold Fn. Speak. Release. Text appears.
That’s it. That’s the whole app.
Built March 12–18, 2026, across 7 Claude Code sessions.
~550 lines of Python. Zero typing required to use it.