4
6 Comments

I thought picking a voice for my app would take a day. It rebuilt everything.

Wrote up one of the harder solo founder decisions I made building yogakosh — choosing a voice for an audio-first yoga app, and what that choice forced me to rebuild.

Spoiler: it wasn't the voice selection that was hard. It was the architecture shift nobody talks about — from a stateless TTS call to a full pre-generation pipeline.

If you're building anything audio-first, the cost math section alone is worth a read.

article link: https://www.yogakosh.com/blog/i-thought-finding-a-yoga-voice-would-be-easy-it-rebuilt-my-entire-app

on June 24, 2026
  1. 1

    Interesting angle — most people underestimate how quickly “simple TTS” turns into a full content pipeline problem once you care about scale and consistency.

    The real insight here is that the voice choice is almost the easy part. The harder shift is exactly what you said: moving from on-demand generation to pre-generation + architecture planning around cost, latency, and reuse.

    Would be curious how much your cost model changed after the switch — that’s usually where it becomes real.

  2. 1

    Really interesting read. It's a great reminder that seemingly simple product decisions can have major architectural implications. The discussion around moving from real-time TTS to a pre-generated audio pipeline and the associated cost considerations was especially valuable. Thanks for sharing the lessons learned and the behind-the-scenes challenges of building an audio-first application.

  3. 1

    This matches something I keep running into. The moment audio stops being a live call and becomes something you pre-generate, the whole app quietly turns into a content pipeline problem and caching and cost start driving the roadmap. I build a lot of small audio and offline-first stuff, and pre-generating also buys you offline playback and way more predictable bills. Did you end up caching per user or generating one shared voice set for everyone?

  4. 1

    What's interesting is that the voice decision sounds like a product decision on the surface, but it ended up becoming an infrastructure decision.

    Those often create very different validation signals.

    Users may tell you they're choosing based on the voice, while the economics and scalability of the business end up being determined by everything that decision forced underneath it.

    1. 1

      That framing is exactly right. The voice felt like a UX/product call. It turned out to be an infra call. The moment ElevenLabs won on quality, the entire architecture followed — pre-generation pipeline, pose-level audio structure, cloud storage, runtime assembly. None of that existed before. Users experience a voice. What they don't see is everything that voice forced into existence underneath it.

      1. 1

        That's exactly why I found your post interesting.

        I think there's a broader decision underneath that I wouldn't do justice to in a thread.

        If you're open to it, what's the best email to reach you on?

Trending on Indie Hackers
Priorities for launching a SaaS solo, with no budget User Avatar 224 comments I built a tool directory that doesn't pretend every founder has the same needs User Avatar 51 comments AI helped me ship faster. Then I forgot what my product actually does. User Avatar 28 comments How I Run a 1.7M Product Search Engine at 66ms on a $0 Hosting Budget User Avatar 6 comments AI Overviews: The Threat to Blogs and Reference Websites User Avatar 6 comments