Building the Fastest AI Audio Transcription Service
Why speed changes the product, not just the user experience, and the stack that makes 200x real-time transcription feel like a single click.
There are things you need. There are things you do not need. There are things you do not know you need. And there are things you do not know you do not need. Fast audio transcription sits in a weird place across all four.
Most people do not wake up wanting a transcription tool. They wake up with a voice memo, a lecture recording, a client call they need to quote back accurately, or a 90-minute YouTube explainer they cannot justify watching in full. Transcription is the boring middle step between "I have audio" and "I have text I can search, edit, and share." The less they notice it, the better it is doing its job.
That is the bet behind Typist: if transcription is fast enough, cheap enough, and clean enough to disappear into the background, people stop rationing it. They stop asking "is this recording worth transcribing?" and just do it. This post is a short build log on how the product got there, what I skipped on purpose, and the stack that runs it today.
The thing that clicked
A few months before starting Typist, I was researching something for another side project. There was a one-hour YouTube explainer I needed to understand. I had zero interest in watching it in real time. My instinct was to drop the transcript into Claude and have a focused conversation instead.
I went looking for a transcription tool. Most consumer services were still running Whisper v2. A few charged per minute for output that was not meaningfully better than the open-source baseline. A couple were fast, but locked behind a Zoom integration I did not need. The pattern was the same everywhere: the underlying model had moved on, but the UX was frozen two years back.
Around the same time, Groq started hosting whisper-large-v3 on their LPU inference stack. One-hour recordings came back in seconds instead of minutes. Every other tool I tried felt slow by comparison. That was the trigger. All the lectures, client calls, and unwatched YouTube guides I had quietly given up on flashed back as actually transcribable. The moat was not the model anymore, it was building a product around the speed.
What I cut on purpose
This was my fifth solo project that year, after things like a Supabase MCP server and a Slack bot that runs Claude Code on demand. Experience taught me what not to ship in v1: no multi-feature dashboard, no unfamiliar infra, no clever architecture for problems I did not have yet.
The rules I wrote down before starting:
- There has to be one killer reason for this to exist.
- The stack should stay mostly familiar so I can ship in weeks, not months.
- One core flow has to work flawlessly before anything else gets built.
- The UX has to be decent enough that I use it myself without wincing.
Looking back after launch, the discipline held up. Typist transcribes an hour of audio in well under a minute. That is roughly 200x real-time on the Turbo model. Accuracy holds up across 99+ languages thanks to whisper-large-v3. The free tier gives you 3 transcriptions with files up to 100 MB, and Pro unlocks 5 GB files with all seven export formats (TXT, DOCX, PDF, SRT, Markdown, WebVTT, JSON). The UI is clean and opinionated. Not pretty yet. Clean.
Transcribe a 1-hour recording in under 30 seconds
Upload any audio or video file and get a full transcript with timestamps. Export to TXT, DOCX, PDF, SRT, Markdown, WebVTT, or JSON.
What I did not know I needed
After building the thing, I kept a list of lessons I wish someone had forced on me before day one.
Distribution from day one. You can build the fastest transcription service in the world, but if nobody knows it exists, it does not exist. Pages like free speech to text software and audio transcription software free are not marketing fluff, they are how users with a very specific job-to-be-done find a tool like Typist in the first place.
Speed as the feature, not a spec. "Fast" is not a bullet point. It is a product decision that changes user behavior. When a transcript comes back in 30 seconds instead of 10 minutes, people stop saving recordings for later and start transcribing them immediately. That one thing cascades into how people use the product, what formats they ask for, and what else they want around it.
Time honesty. I genuinely thought I could ship v1 in three days. It took three weeks. Every solo builder I know has this same delusion about time. The fix is not working faster. The fix is scoping smaller.
Quiet product decisions. No watermark on free exports. No "upgrade to remove" banner on the dashboard. No email drip that punishes you for not paying. The free tier has to feel like a product, not a trap, or people do not trust the paid one.
The stack
In 2026, the stack is not the moat. Anyone can clone the infra decisions below. The moat is taste, speed, and shipping. That said, here is the full build so you do not have to reverse-engineer it:
- TanStack Start for the React app with SSR and hydration, file-based routing, no RSC headaches
- Hono on Cloudflare Workers for the API and server functions
- Cloudflare Workflows for the multi-step transcription pipeline (upload, probe, chunk, transcribe, stitch, persist)
- Groq hosting
whisper-large-v3for the core transcription step - Cloudflare D1 as the primary relational database
- Cloudflare R2 for audio, video, and export artifact storage
- Better Auth for sessions, OAuth, and later Stripe integration
Nothing revolutionary. You could copy this stack tomorrow. A handful of people will. Most will not ship.
Want to see the 200x real-time claim in practice? Upload a one-hour MP3 or MP4 and watch it finish before your coffee cools. Start transcribing
Things you do not need anymore
Now that the tool exists, here is what you can stop doing:
Waiting ten minutes for a transcript. Whatever you were using before, it was probably good enough for the occasional meeting. It was not good enough to change how you work. Upload a file to Typist and see what "good enough to change behavior" feels like.
Typing out "just this one quote" by hand. Everyone does this. Everyone regrets it. A two-minute section turns into a 20-minute scrub-and-type slog. There is a better use of your afternoon.
Paying $30 a month for six transcripts. Flat-rate subscriptions make sense for heavy users. For everyone else, three free transcriptions per account plus a flat $10/month Pro plan (billed yearly) is saner than renting minutes you will never use.
What is next
The honest answer is that I do not know. The most interesting part of a launch is watching what users actually do with the speed advantage. Early patterns so far:
- Students transcribing entire semesters of lectures in a single sitting.
- Podcasters building searchable archives of their back catalogue to reuse clips.
- Researchers feeding interview transcripts straight into qualitative analysis pipelines.
- Non-native speakers reading transcripts alongside audio to close the comprehension gap.
Each of those nudges the roadmap somewhere. AI Insight Packs (summaries, chapters, action items, quotes) came out of the researcher and student signal. The seven-format export matrix came out of the podcaster signal. The next few features will come from whatever the users who actually paid tell me is still annoying.
The one thing I will not do is let Typist drift into being another Swiss-army AI tool. It stays focused on the core promise: the fastest, most accurate audio and video transcription you can put in front of a human.
Upload any audio or video file. Get a transcript in seconds. No credit card. Try Typist free
Try it yourself
The fastest way to understand what "fast" means in this context is to try it. Record something with your phone, drag the file into the dashboard, and see how the output compares to whatever you use today. If you hate it, you have lost 30 seconds. If you like it, you have a new default.