5
13 Comments

I thought picking a voice for my app would take a day. It rebuilt everything.

Wrote up one of the harder solo founder decisions I made building yogakosh — choosing a voice for an audio-first yoga app, and what that choice forced me to rebuild.

Spoiler: it wasn't the voice selection that was hard. It was the architecture shift nobody talks about — from a stateless TTS call to a full pre-generation pipeline.

If you're building anything audio-first, the cost math section alone is worth a read.

article link: https://www.yogakosh.com/blog/i-thought-finding-a-yoga-voice-would-be-easy-it-rebuilt-my-entire-app

on June 24, 2026
  1. 1

    This resonates more than I expected.

    People usually think voice is branding, but it quietly changes who feels understood by the product.

    Was there one customer conversation that made you realize the original voice wasn't working?

    1. 1

      Exactly!! ...I tried initial alpha release with few users and they all complained about the original voice being too robotic, thus confirming my hypothesis

  2. 1

    This is such a great example of how a seemingly small UX choice cascades into infrastructure decisions. The shift from stateless TTS to pre-generation pipelines is exactly the kind of constraint that forces better architecture thinking - consistency, latency, and cost all improve.

    The fact that this was forced by voice selection rather than obvious from the start shows how much builders underestimate audio-first requirements. Did the pre-generation pipeline also improve your deployment complexity, or did it mostly just fix the cost/consistency side?

    1. 1

      Exactly this — and honestly, deployment complexity got worse before it got better. Pre-generation meant building a pipeline I hadn't planned for: batch jobs, cloud storage, a data structure to map audio to poses at runtime. More moving parts upfront.
      Caching and loading was its own undertaking. Before I got that right, audio just wouldn't load — which completely breaks the experience for an audio-first app. That forced me to get serious about how and when assets were fetched and held in memory.
      But all of it forced clarity. When you're not generating on the fly, every decision about the flow — structure, sequencing, cue timing — has to be made in advance. That constraint made the product more intentional.

  3. 1

    Interesting angle — most people underestimate how quickly “simple TTS” turns into a full content pipeline problem once you care about scale and consistency.

    The real insight here is that the voice choice is almost the easy part. The harder shift is exactly what you said: moving from on-demand generation to pre-generation + architecture planning around cost, latency, and reuse.

    Would be curious how much your cost model changed after the switch — that’s usually where it becomes real.

    1. 1

      Cost became real fast. It went from $0 (experimenting with free tiers) to paying per word generated — and that changes how you think about every decision in the pipeline.
      What made it more unpredictable: generations sometimes come out garbled, which means re-runs. So you're not just budgeting for the happy path.
      Pre-generation meant I could batch everything upfront and know exactly what I was spending — rather than costs creeping up with every user session.

  4. 1

    Really interesting read. It's a great reminder that seemingly simple product decisions can have major architectural implications. The discussion around moving from real-time TTS to a pre-generated audio pipeline and the associated cost considerations was especially valuable. Thanks for sharing the lessons learned and the behind-the-scenes challenges of building an audio-first application.

    1. 1

      Thanks! glad you like it

  5. 1

    This matches something I keep running into. The moment audio stops being a live call and becomes something you pre-generate, the whole app quietly turns into a content pipeline problem and caching and cost start driving the roadmap. I build a lot of small audio and offline-first stuff, and pre-generating also buys you offline playback and way more predictable bills. Did you end up caching per user or generating one shared voice set for everyone?

    1. 1

      Shared voice — one generation per pose cue, reused across all users and flows. That's actually what made the caching tractable. No per-user state to manage, just a clean asset library fetched at the right time.
      Although i am thinking to add more voices and will have to think about this too :)

  6. 1

    What's interesting is that the voice decision sounds like a product decision on the surface, but it ended up becoming an infrastructure decision.

    Those often create very different validation signals.

    Users may tell you they're choosing based on the voice, while the economics and scalability of the business end up being determined by everything that decision forced underneath it.

    1. 1

      That framing is exactly right. The voice felt like a UX/product call. It turned out to be an infra call. The moment ElevenLabs won on quality, the entire architecture followed — pre-generation pipeline, pose-level audio structure, cloud storage, runtime assembly. None of that existed before. Users experience a voice. What they don't see is everything that voice forced into existence underneath it.

      1. 1

        That's exactly why I found your post interesting.

        I think there's a broader decision underneath that I wouldn't do justice to in a thread.

        If you're open to it, what's the best email to reach you on?

Trending on Indie Hackers
Priorities for launching a SaaS solo, with no budget User Avatar 227 comments I built a tool directory that doesn't pretend every founder has the same needs User Avatar 54 comments AI helped me ship faster. Then I forgot what my product actually does. User Avatar 33 comments How I Run a 1.7M Product Search Engine at 66ms on a $0 Hosting Budget User Avatar 9 comments Most early-stage SaaS companies miss churn signals — here’s how to catch them early User Avatar 8 comments