How to Build ML / AI Based Infrastructure Which Is Self Improving Over Time

Most AI failures are not failures of intelligence. They are failures of memory. The model makes a mistake, a human corrects it, and the model forgets by morning. This cycle repeats thousands of times across the industry every day. 95% of generative AI pilots yield no tangible business impact, with only 5% of organizations successfully integrating AI tools into production at scale. The root cause is not bad algorithms or insufficient data. It is broken feedback loops.

Pratyusha Singaraju is a senior software engineer with 13+ years across Microsoft and Netflix. She builds knowledge graph infrastructure and ML-powered data pipelines for hundreds of millions of users. At Microsoft, she worked on the knowledge graph powering Bing, building web-scale ingestion pipelines. Today, she focuses on ML-driven/LLM-backed data pipelines to improve data discoverability. Her core belief: systems must improve themselves over time, or they will eventually break trust. Beyond her engineering work, she serves as a judge for the ICLR 2026 Workshop on Symbolic and Probabilistic Integration (SPOT), evaluating research that bridges cutting-edge AI with real-world systems.

We spoke with her about why most ML systems never actually learn from their mistakes, how editorial overrides became the secret weapon at Microsoft, and what it takes to build feedback loops that turn human corrections into automatic model improvements.

Pratyusha, most teams track model accuracy and assume that is enough. What actually tells you whether an ML system is improving over time, and where do most teams miss the signal entirely?

Thank you for having me. In my experience, teams fixate on model accuracy as their primary metric, often chasing 95% or 99% on static test sets. But accuracy is a lagging indicator. It tells you how the model performed on data from last month. It does not tell you whether the model is learning from today's mistakes.

The real signal is whether corrections made by humans feed back into retraining. A model could be 95% accurate, but if that 5% of mistakes never makes it back to the data science team, the model never evolves. It just repeats the same wrong answers forever. I have seen teams celebrate 99% accuracy while their model drifts further from production reality every week. They are measuring the wrong thing.

What separates self improving systems from stagnant ones is a simple question: when a reviewer corrects a model's mistake, where does that correction go? If the answer is nowhere, you do not have a feedback loop. You have a treadmill.

You mentioned building an editorial path alongside automated ingestion at Microsoft. How did that work for the Bing knowledge graph, and what made it faster than waiting for engineering fixes?

At Microsoft, I worked on the knowledge graph that powered Bing. We had automated ingestion pipelines pulling from hundreds of sources. When they broke, they broke in strange ways. A source would change its schema overnight. A partial refresh would corrupt a subset of entities. Rules written six months earlier could not anticipate a new type of relationship.

The wrong response would have been treating every failure as a pipeline bug requiring engineering time to fix. That does not scale. Instead, we built an editorial path alongside automated ingestion. When ingestion failed for a specific entity or relationship, editors could intervene directly. They would fix a misclassified entity or add a missing relationship. The correction took effect immediately.

This had two benefits. First, user impact dropped to nearly zero. Instead of waiting days for a pipeline fix, editors resolved the issue in minutes. Second, engineering teams gained breathing room. They could diagnose root causes without a live incident burning down around them. No rollbacks, no emergency deploys. Just clean, auditable corrections.

Critically, those editorial corrections also gave us a signal. We could see which parts of the pipeline generated the most overrides. That data shaped engineering priorities. We did not have to guess where the system was weakest. The editorial path told us.

What made that editorial path different from a typical override system, and why do most teams fail to build something similar for their ML workflows?

The difference is how the override is treated. In most systems, an override is an exception. It sits outside the normal data flow. It fixes the immediate problem, but then it is forgotten. The system learns nothing.

In our design, overrides were first class data. They were logged with full context: what the model predicted, what the correct answer was, which version of the model made the prediction, who made the override, and when. That data fed into dashboards that showed exactly where the pipeline was failing most often.

Most teams fail to build this because they treat feedback as an afterthought. They build the model, build the inference pipeline, then realize they need a way to collect corrections. By then, the data structures are locked, the logging is an afterthought, and adding feedback requires rearchitecture. I have seen this pattern repeat across multiple organizations. The teams that succeed are the ones that design the feedback loop before they write a single line of model inference code.

Your recent work has involved ML-powered pipelines that surface concepts across two levels of resolution — fine-grained inference units and higher-order entity aggregations. Can you walk us through how you designed human review into that workflow without creating a bottleneck?

The raw model outputs per input unit with probability scores. Humans cannot review these directly at scale. I built a translation layer that structures model outputs into reviewable evidence units before surfacing them to reviewers.
The key design decision was that algorithmic suggestions cannot advance until they survive human validation. Every ML-generated evidence unit meeting a threshold criterion is reviewed. The reviewer can accept, reject, or modify. That decision is logged. The system tracks exactly which data points originated from the model versus which were corrected by a human.

To prevent bottlenecks, we built three things. First, annotation caching snapshots ML outputs when a task becomes ready — the reviewer works against a stable snapshot regardless of upstream activity. Second, we hid ML confidence scores from reviewers to prevent anchoring bias; hiding scores forced reviewers to evaluate the evidence itself rather than deferring to the model. Third, we built a priority queue so time-sensitive work advances ahead of lower-priority items. The system never treats all inputs equally.

Organizations implementing continuous feedback loops across the ML lifecycle achieve 2x to 3x faster model iteration cycles compared to those without automated feedback mechanisms. Our infrastructure operated on exactly this principle.

Once reviewers made corrections, what actually happened to that data? How did it flow back to the data science team and into model retraining?

Every correction became training signal for the next model version. If a reviewer rejected a prediction at a specific decision point, that rejection was logged as a negative example. If they modified a label, that became a corrected training example.

Data science teams received a structured feed of all corrections, segmented by concept type, reviewer, and confidence bucket — making it easy to identify where the model struggled and which reviewers produced the most reliable signal. This enabled significant growth in training data and measurable improvements in the model over time.

New algorithm versions were deployed and picked up automatically for future ingestions. No manual intervention, no redeployment ceremonies. The more the system ran, the better it got.

What was the measurable impact of that system in production?

Consuming ML suggestions unlocked the ability to capture granular structured data at a scale that manual processes couldn't reach. For analysts, this meant working against a scoped set of suggestions rather than processing raw inputs end-to-end — a significant reduction in per-unit review time. The feedback loop made the system progressively more accurate, further reducing analyst workload over time.

You have worked on knowledge graphs and ML-powered data pipelines across major tech companies. What is the single most important lesson that applies across both, and how should leaders think differently about building AI infrastructure today?

The core lesson is that feedback cannot be an afterthought. Whether you are ingesting web-scale entity data or running ML models on structured inputs, the same principles apply: traceable decisions, reversible actions, automated feedback collection.

AI systems add complexity because outputs are probabilistic, not deterministic. But the foundation is still systems engineering. Without strong infrastructure, even the most advanced algorithms cannot deliver reliable outcomes. A great model running on bad infrastructure will fail every time. I have seen 99% accurate models become useless in production because the infrastructure around them could not capture the 1% of failures.

Leaders should not just ask about model accuracy — they should also ask about feedback loops. How does the system learn from its mistakes? Where do corrections go? Who reviews the outputs? How long does it take for a correction to become a training example? Those questions determine whether an ML system improves over time or stagnates. Model accuracy is a vanity metric. Feedback loop completeness is a survival metric.

Research on telemetry-driven auto-tuning for AI cloud infrastructure demonstrates that closed-loop feedback systems can achieve up to 40% improvement in GPU efficiency and a 30% reduction in job duration compared to fixed resource allocations.

Singaraju's infrastructure at Netflix spans all eligible data at scale, collecting granular, structured metadata that improves data discoverability while generating the labeled examples needed to make the next model version more accurate.

Her perspective is further shaped by serving on the program committee for the AGI 26 conference, where she evaluates cutting edge research on artificial general intelligence and its implications for real world systems.

The work of building self improving ML infrastructure rarely receives the attention of new model architectures or flashy demos. Yet its impact is measured in what does not happen: errors not repeated, corrections not lost, models not drifting. As AI systems become more embedded in critical workflows, the ability to build feedback loops that turn human judgment into automatic improvement will define the next standard of ML engineering.

Feedback loops are not optional. Built in from the start, the system improves on its own. Left for later, the team will eventually rebuild everything just to add them. Singaraju has seen both. One is much easier.