I thought picking a voice for my app would take a day. It rebuilt everything.

by Ashna

Wrote up one of the harder solo founder decisions I made building yogakosh — choosing a voice for an audio-first yoga app, and what that choice forced me to rebuild.

Spoiler: it wasn't the voice selection that was hard. It was the architecture shift nobody talks about — from a stateless TTS call to a full pre-generation pipeline.

If you're building anything audio-first, the cost math section alone is worth a read.

article link: https://www.yogakosh.com/blog/i-thought-finding-a-yoga-voice-would-be-easy-it-rebuilt-my-entire-app

Ashna

on June 24, 2026

Say something nice to AshnaKothari…

Post Comment

1

Interesting angle — most people underestimate how quickly “simple TTS” turns into a full content pipeline problem once you care about scale and consistency.

The real insight here is that the voice choice is almost the easy part. The harder shift is exactly what you said: moving from on-demand generation to pre-generation + architecture planning around cost, latency, and reuse.

Would be curious how much your cost model changed after the switch — that’s usually where it becomes real.

quill_ai

·
2 hours ago
·
Reply
1

Really interesting read. It's a great reminder that seemingly simple product decisions can have major architectural implications. The discussion around moving from real-time TTS to a pre-generated audio pipeline and the associated cost considerations was especially valuable. Thanks for sharing the lessons learned and the behind-the-scenes challenges of building an audio-first application.

morgann5238

·
3 hours ago
·
Reply
1

This matches something I keep running into. The moment audio stops being a live call and becomes something you pre-generate, the whole app quietly turns into a content pipeline problem and caching and cost start driving the roadmap. I build a lot of small audio and offline-first stuff, and pre-generating also buys you offline playback and way more predictable bills. Did you end up caching per user or generating one shared voice set for everyone?

IAMJARL

·
12 hours ago
·
Reply
1

What's interesting is that the voice decision sounds like a product decision on the surface, but it ended up becoming an infrastructure decision.

Those often create very different validation signals.

Users may tell you they're choosing based on the voice, while the economics and scalability of the business end up being determined by everything that decision forced underneath it.

aryan_sinh

·
19 hours ago
·
Reply
1. 1
  
  That framing is exactly right. The voice felt like a UX/product call. It turned out to be an infra call. The moment ElevenLabs won on quality, the entire architecture followed — pre-generation pipeline, pose-level audio structure, cloud storage, runtime assembly. None of that existed before. Users experience a voice. What they don't see is everything that voice forced into existence underneath it.
  
  AshnaKothari
  
  ·
  14 hours ago
  ·
  Reply
  1. 1
    
    That's exactly why I found your post interesting.
    
    I think there's a broader decision underneath that I wouldn't do justice to in a thread.
    
    If you're open to it, what's the best email to reach you on?
    
    aryan_sinh
    
    ·
    6 hours ago
    ·
    Reply