I thought picking a voice for my app would take a day. It rebuilt everything.

by Ashna

Wrote up one of the harder solo founder decisions I made building yogakosh — choosing a voice for an audio-first yoga app, and what that choice forced me to rebuild.

Spoiler: it wasn't the voice selection that was hard. It was the architecture shift nobody talks about — from a stateless TTS call to a full pre-generation pipeline.

If you're building anything audio-first, the cost math section alone is worth a read.

article link: https://www.yogakosh.com/blog/i-thought-finding-a-yoga-voice-would-be-easy-it-rebuilt-my-entire-app

Ashna

on June 24, 2026

Say something nice to AshnaKothari…

Post Comment

1

This resonates more than I expected.

People usually think voice is branding, but it quietly changes who feels understood by the product.

Was there one customer conversation that made you realize the original voice wasn't working?

FounderFlow_57

·
5 hours ago
·
Reply
1. 1
  
  Exactly!! ...I tried initial alpha release with few users and they all complained about the original voice being too robotic, thus confirming my hypothesis
  
  AshnaKothari
  
  ·
  4 hours ago
  ·
  Reply
1

This is such a great example of how a seemingly small UX choice cascades into infrastructure decisions. The shift from stateless TTS to pre-generation pipelines is exactly the kind of constraint that forces better architecture thinking - consistency, latency, and cost all improve.

The fact that this was forced by voice selection rather than obvious from the start shows how much builders underestimate audio-first requirements. Did the pre-generation pipeline also improve your deployment complexity, or did it mostly just fix the cost/consistency side?

galdayan

·
7 hours ago
·
Reply
1. 1
  
  Exactly this — and honestly, deployment complexity got worse before it got better. Pre-generation meant building a pipeline I hadn't planned for: batch jobs, cloud storage, a data structure to map audio to poses at runtime. More moving parts upfront.
  Caching and loading was its own undertaking. Before I got that right, audio just wouldn't load — which completely breaks the experience for an audio-first app. That forced me to get serious about how and when assets were fetched and held in memory.
  But all of it forced clarity. When you're not generating on the fly, every decision about the flow — structure, sequencing, cue timing — has to be made in advance. That constraint made the product more intentional.
  
  AshnaKothari
  
  ·
  3 hours ago
  ·
  Reply
1

Interesting angle — most people underestimate how quickly “simple TTS” turns into a full content pipeline problem once you care about scale and consistency.

The real insight here is that the voice choice is almost the easy part. The harder shift is exactly what you said: moving from on-demand generation to pre-generation + architecture planning around cost, latency, and reuse.

Would be curious how much your cost model changed after the switch — that’s usually where it becomes real.

quill_ai

·
10 hours ago
·
Reply
1. 1
  
  Cost became real fast. It went from $0 (experimenting with free tiers) to paying per word generated — and that changes how you think about every decision in the pipeline.
  What made it more unpredictable: generations sometimes come out garbled, which means re-runs. So you're not just budgeting for the happy path.
  Pre-generation meant I could batch everything upfront and know exactly what I was spending — rather than costs creeping up with every user session.
  
  AshnaKothari
  
  ·
  3 hours ago
  ·
  Reply
1

Really interesting read. It's a great reminder that seemingly simple product decisions can have major architectural implications. The discussion around moving from real-time TTS to a pre-generated audio pipeline and the associated cost considerations was especially valuable. Thanks for sharing the lessons learned and the behind-the-scenes challenges of building an audio-first application.

morgann5238

·
12 hours ago
·
Reply
1. 1
  
  Thanks! glad you like it
  
  AshnaKothari
  
  ·
  3 hours ago
  ·
  Reply
1

This matches something I keep running into. The moment audio stops being a live call and becomes something you pre-generate, the whole app quietly turns into a content pipeline problem and caching and cost start driving the roadmap. I build a lot of small audio and offline-first stuff, and pre-generating also buys you offline playback and way more predictable bills. Did you end up caching per user or generating one shared voice set for everyone?

IAMJARL

·
20 hours ago
·
Reply
1. 1
  
  Shared voice — one generation per pose cue, reused across all users and flows. That's actually what made the caching tractable. No per-user state to manage, just a clean asset library fetched at the right time.
  Although i am thinking to add more voices and will have to think about this too :)
  
  AshnaKothari
  
  ·
  3 hours ago
  ·
  Reply
1

What's interesting is that the voice decision sounds like a product decision on the surface, but it ended up becoming an infrastructure decision.

Those often create very different validation signals.

Users may tell you they're choosing based on the voice, while the economics and scalability of the business end up being determined by everything that decision forced underneath it.

aryan_sinh

·
a day ago
·
Reply
1. 1
  
  That framing is exactly right. The voice felt like a UX/product call. It turned out to be an infra call. The moment ElevenLabs won on quality, the entire architecture followed — pre-generation pipeline, pose-level audio structure, cloud storage, runtime assembly. None of that existed before. Users experience a voice. What they don't see is everything that voice forced into existence underneath it.
  
  AshnaKothari
  
  ·
  a day ago
  ·
  Reply
  1. 1
    
    That's exactly why I found your post interesting.
    
    I think there's a broader decision underneath that I wouldn't do justice to in a thread.
    
    If you're open to it, what's the best email to reach you on?
    
    aryan_sinh
    
    ·
    14 hours ago
    ·
    Reply