I Built My First Chrome Extension — DictAda, a Private Voice-to-Text Tool Powered by My Own GPU
Today I deployed my first-ever browser extension, and I'm genuinely excited about it. Not because it's flashy — it's actually pretty minimal — but because of what it represents in my coding journey and the problem it solves.
The Problem: Voice Input Without Selling Your Soul
I dictate a lot. Detailed instructions, long-form thoughts, task descriptions for my AI agents. The built-in speech-to-text options all share one thing in common: your voice data goes to Google, Apple, or Microsoft servers. You have zero visibility into what happens to it after that.
I wanted something different — something where my voice data never touches infrastructure I don't control.
What I Built: DictAda
DictAda is a Chrome extension (works in Opera and other Chromium browsers too) that records your voice and transcribes it using OpenAI's Whisper model — but running on my own GPU container, not OpenAI's servers.
The architecture is dead simple:
🎙️ Your mic → Browser Extension → Audio blob → My GPU (Modal.com T4) → Text → Inserted at cursor
That's it. No account creation. No data retention. The audio hits my container, gets transcribed, and the container forgets it ever existed.
The Stack
Backend: Modal.com + faster-whisper
- faster-whisper: A CTranslate2-based reimplementation of Whisper that's 4x faster than the original
- Model:
small(good balance of speed vs. accuracy, handles multiple languages) - GPU: NVIDIA T4 via Modal.com
- Framework: FastAPI for the HTTP endpoint
- Cost: Essentially free — Modal gives $30/month free tier, and each transcription costs ~$0.0003 in GPU time. That's roughly 100,000 dictations per month before I'd pay anything.
The backend is a single Python file. Modal handles all the containerization, GPU allocation, and auto-scaling. When nobody's using it, the container sleeps. When I speak, it wakes up, transcribes, and goes back to sleep.
Extension: Manifest V3
- Clean popup UI with a mic button
- Keyboard shortcut:
Ctrl+Shift+S(or⌘+Shift+Son Mac) - Records via the Web Audio API
- Inserts the transcribed text directly at your cursor position
- Works with
<input>,<textarea>, and contentEditable elements (so basically everywhere)
The Journey
I'm not going to pretend this was a months-long odyssey. The actual coding was fast — I used Claude Code to help scaffold both the backend and the extension. But that's kind of the point.
What took longer was the decision to build it. I'd been using various voice-to-text tools for a while, always slightly uncomfortable with where my audio was going. One day I realized: I already know Python, I already use Modal for other things, and Chrome extensions are just HTML/JS. Why am I not doing this myself?
The hardest part was honestly getting microphone permissions working in Opera. Not a code problem — a browser UX problem. But once that was sorted, hearing my own words appear on screen via my own infrastructure? That felt different.
Why This Matters to Me
This is my first Chrome extension. Ever. After years of building web apps, APIs, and backends, I'd never shipped a browser extension. It feels like unlocking a new category.
But more importantly, this fits into a bigger picture I'm building: autonomous AI workflows. My main use case for DictAda isn't casual dictation — it's speaking detailed instructions that get sent to AI coding agents. I describe what I want built, DictAda transcribes it, and the agents go to work. Voice → text → code → deployed. All while I'm away from the keyboard.
Privacy isn't a feature here — it's the foundation. When your voice commands control your development pipeline, you don't want that audio sitting on someone else's server.
What I'd Improve
- Cold start latency: First request after the container sleeps takes 5-10 seconds (GPU boot + model loading). Subsequent requests are near-instant. I could keep a container warm 24/7, but the current tradeoff is fine for my use case.
- Model size: The
smallmodel is good, butmediumorlarge-v3would be more accurate for technical jargon. That's a future upgrade. - API authentication: Right now it's open. I'll add a Bearer token before sharing the URL with anyone.
Try It Yourself
The full source is private for now, but the approach is completely reproducible:
- Sign up for Modal.com (generous free tier)
- Deploy a
faster-whispermodel on a T4 GPU - Write a Manifest V3 extension that records audio and POSTs it to your endpoint
- Load it as an unpacked extension in your browser
The whole thing is maybe 200 lines of Python and 150 lines of JavaScript. No build tools, no frameworks, no complexity.
The Takeaway
Sometimes the most satisfying projects aren't the biggest ones. They're the ones that scratch an itch you've been ignoring. I wanted private voice-to-text. Now I have it, running on my own GPU, in my own browser, under my complete control.
First extension shipped. Feels good. 🎙️✨
I'm Christian, a software developer building Afrotomation — an AI automation agency. I write about building tools, shipping fast, and automating the boring stuff.