How the top health company is building AI

I sat down with Roman Bugaev (CTO) and Vlad Nedosekin (Director of AI) from Flo Health, the biggest women's health app in the world, with 80M users and growing by 1 M a month. The thing that stuck with me most wasn't a model or an architecture diagram. It was a rule: "Whenever it's possible, we are not doing AI." They start with if-then-else. Only when the conditions become unmanageable do they reach for ML. For a company that runs 400 A/B tests a quarter, fine-tunes Llama 70B on 10,000 H100 hours, and builds its own router that triages user questions like a GP sending patients to specialists, that restraint is the real moat.

Users prefer talking to the Flo logo over a photo of a human doctor; engagement goes up when it looks less like a person. In their three-person blind tests, the AI often turns out to be right and the human clinician wrong, because humans get tired and AI doesn't. They refuse to fine-tune on real user data, so they generate synthetic women's health data with their medical team in the loop, specifically to counter the bias baked into general models trained on male-default medicine. Inference now costs them more than training. Prompts have ballooned to 100k tokens because medical context matters. And their defense against copycats isn't secrecy, it's that by the time you finish cloning today's Flo, they've already shipped the next 400 experiments.

The meta-lesson for anyone building an AI product: evaluation is the work. Roman said it three times in a row, "evaluation, evaluation, evaluation." There is no single best model; they test every model against every use case on medical safety, usefulness, cost, and latency. They won't touch proprietary models that silently update, because stability matters more than benchmark wins when you're giving health advice. Build your competitive advantage, buy everything else. Full episode here if you want the whole thing.

Kyriakos Eleftheriou

posted to

Developers

on April 18, 2026

Say something nice to kyriakos…

Post Comment

1

Really like that principle — “don’t use AI until you actually need it.” It’s easy to forget how far simple logic can go.

The part about evaluation being the real work hits hard too. Most teams obsess over models, but not enough over measuring what actually matters (especially in something as sensitive as health).

Also interesting that users trust the logo more than a human face — kind of counterintuitive, but makes sense in terms of perceived neutrality.

Curious — how do they structure those evaluations at scale with 400+ experiments? That sounds like the real secret sauce.

Jrealpthe

·
17 days ago
·
Reply