Much of the public conversation around artificial intelligence still gravitates toward model size, benchmark gains, and new algorithmic techniques. In production systems, that framing only captures part of the problem. The harder constraint often sits below the model layer: whether the infrastructure can serve increasingly complex predictions within strict latency, throughput, and reliability budgets. Stanford HAI's 2025 AI Index underscored that reality from the broader market side, noting steep declines in inference cost and steady gains in hardware performance and energy efficiency, developments that are pushing more ambitious AI workloads into production.
Brooke Xiaoxi Bian works at that infrastructure layer. A staff software engineer and technical lead, she has built large-scale ranking and personalization systems designed to operate under extreme production constraints, where model quality alone is never enough. Her work spans product-aware ad ranking, large-scale retrieval, and GPU-accelerated inference, all in environments where a few milliseconds can determine whether a system is commercially viable. She is also a co-author of Closing the Online-Offline Gap: A Scalable Framework for Composed Model Evaluation, presented at RecSys 2025, which describes a production-aware evaluation framework that improved correlation with top-line results by up to 18%.
In this interview, Bian explains why the next phase of AI personalization will be shaped as much by systems architecture as by model design, and why GPUs are changing what ranking systems can realistically do in production.
Most people describe AI progress as a model problem. From your perspective, what is actually limiting real-world AI systems today?
Honestly, I think the biggest challenge in real-world AI isn't just about building better models anymore, it's about whether our infrastructure can actually keep up. You can have the most impressive model offline, but once you try to put it into production, where it needs to respond in tens of milliseconds, handle massive traffic, and stay reliable even as demand shifts, the real test begins. If the serving path, memory movement, retrieval pipeline, or cost envelope can’t support it, that model just isn’t going to make it.
This is especially true for personalization systems. These aren't simple, one-off inference jobs. They're complex, tightly-coupled decision systems that need to pull candidates, score them, combine a bunch of signals, and make a decision instantly. In those cases, the real bottleneck is almost always how compute, latency, and systems design interact, not the model itself. That's why production-aware evaluation has become so important in large-scale recommendation systems.
A lot of folks in the industry still talk as if a better algorithm automatically means a better product. But in reality, what matters is whether the whole system can absorb those algorithmic gains without blowing past the latency, hardware, or reliability budgets. That's where I see most real-world AI systems hitting their limits.
At internet scale, what tends to break first when those production limits get pushed in live ranking systems?
At production scale, the first thing that tends to break is usually not model accuracy. It is the production budget surrounding the model. Latency expands, memory traffic spikes, candidate evaluation becomes too expensive, or the serving architecture cannot keep up with the number of decisions the system needs to make per second. Once that happens, teams start backing off complexity even if the model itself is better.
That constraint becomes more severe in ranking systems because the model is only one stage inside a much larger pipeline. Retrieval, candidate pruning, ranking, and downstream utilization still have to work together under very tight timing windows. In my Product-Centric Ranking work, that environment required a multi-stage architecture that could support roughly 100× traffic growth, operate against a corpus of about 3 billion items, handle production throughput around 400 million QPS, and still target latency around 35 milliseconds. In settings like that, architectural discipline becomes non-negotiable. It is not a problem that can be solved by dropping in a larger model and hoping the rest of the system adapts around it.
According to Deloitte's 2026 technology predictions report, inference workloads now account for roughly two-thirds of all AI compute, up from a third in 2023, reflecting how dramatically the pressure on production infrastructure has grown. That shift does not remove the engineering problem. It intensifies it, forcing teams to redesign the stack underneath to keep pace.
Beyond raw compute speed, what specifically made CPU-based systems a structural bottleneck for modern ranking and personalization?
Modern ranking systems are heavily relying on GPU instead of CPU precisely because of the compound costs that accumulate at scale. Ranking systems are dominated by embedding movement, repeated scoring across large candidate pools, and the operational need to keep latency predictable even during heavy traffic. When models become substantially more expressive, those costs grow faster than many traditional CPU-based serving paths can tolerate. That creates a ceiling on what can be deployed even when the research case for a better model is strong.
My work on product-centric ranking reflects that exact pressure. In environments where the system must retrieve personalized candidates from billions of products, rerank them, and inject product-level signals into ad ranking in real time, the architecture itself determines what is feasible. That is why the system relied on staged retrieval, lightweight reranking, and clustering rather than brute-force expansion. The lesson, at least from what I have seen, is broader than any one project: once latency is a hard production requirement, system design becomes part of the model design.
How does GPU-based inference change what ranking systems can actually do in production?
I think GPU inference changes the conversation from incremental tuning to capability expansion. Once the serving architecture is rebuilt for accelerators, the system can support classes of models that were previously impractical in production. That does not mean every ranking problem automatically becomes easier. It means the feasible design space becomes much larger.
In a recent GPU-accelerated product ranking effort that I oversaw, that shift enabled roughly 1000× more model compute and about 100× more candidate scaling, while also reducing embedding fetch latency from more than 10 milliseconds on CPU paths to under 2 milliseconds on GPU. The system also scaled candidate evaluation to around 25 million products per second per GPU host. Those numbers matter because they change what the model can afford to know before a ranking decision is made. Instead of trimming ambition to fit an older serving path, teams can begin redesigning ranking logic around much more expressive inference.
What is important here is that the GPU does not act as a standalone speedup switch. It requires architectural changes around data movement, candidate management, memory efficiency, and model structure. In our work, that included latent encoding to reduce embedding dimensionality and cluster-first ranking to keep compute under control as candidate coverage expanded. That is the real pattern. GPU adoption in ranking systems is not just a hardware swap. It is a systems rewrite.
How do you balance model complexity, latency, and cost without compromising production usefulness?
For me, the balance starts with accepting that no single metric can govern the system. Teams often over-optimize for model lift in isolation, then discover that the deployment cost or latency inflation erases the practical value of that gain. In production ranking, the right question is not whether a model is better in a vacuum. It is whether the improvement survives the entire serving pathway.
That is why decomposition matters so much. In my work, the architecture separates retrieval, reranking, and downstream utilization so each stage can spend compute where it adds the most value. Retrieval narrows the space.
Reranking focuses on the candidates that justify more expensive scoring. Clustering reduces waste in repeated operations. On the GPU side, compression and serving-aware model design make it possible to scale complexity without letting memory bandwidth or latency dominate the system.
My article on composed model evaluation reinforces the same idea from another angle. The framework described in the RecSys 2025 paper measures model behavior in a production-like environment where multiple predictions are recomposed into final business outcomes, rather than treating standalone prediction quality as the final answer. That approach improved correlation with top-line results by up to 18%, which is significant because it brings offline development closer to what the live system actually values.
The deeper point, in my view, is that mature AI engineering is becoming less about chasing isolated model gains and more about preserving useful gains through the full decision stack. That is where cost, latency, and quality stop being competing objectives and start becoming one design problem.
What does this infrastructure shift mean for the future of AI personalization?
It means the next advances in personalization will come from teams that treat infrastructure as a first-order product problem. Better models still matter. They will continue to matter. But at production scale, the systems that win will be the ones that can operationalize richer intelligence without collapsing under their own computational weight. By early 2026, inference workloads were already consuming over 55% of AI-optimized infrastructure spending, with projections pointing toward 70–80% of total AI compute costs by year end, according to industry analysis from Unified AI Hub, a trajectory that signals how rapidly the center of gravity in AI has shifted from building models to running them.
That is the transition Brooke Xiaoxi Bian's work helps illustrate. Product-centric ranking showed why system architecture had to be redesigned before product-level relevance could be introduced meaningfully into live ranking. GPU-accelerated ranking showed what becomes possible once the serving layer stops forcing the model to stay small. Together, those efforts point to the same conclusion: the future of AI personalization will not be determined only by who builds the strongest model. It will be determined by who builds the infrastructure that makes stronger models usable in real time.
In other words, the next major gains in AI personalization may look like algorithmic progress from the outside. Underneath, they are just as likely to be infrastructure breakthroughs.