Building Resilient Distributed Systems at Scale: Lessons from High Traffic Platforms

Gift cards look simple to the person using them. Type a code, see a balance, move on. Underneath, they behave like money, and the systems behind them have to be correct every time, in every region, under real load. That load is not theoretical. In a National Retail Federation survey, gift cards were among the top gifts shoppers had already picked up, cited by 27% of respondents.

Saumya Tyagi is a Staff Engineer at Coupang and a software engineering leader with over a decade of experience building and scaling distributed systems at Google, Amazon, and Coupang. He also serves on the editorial board of the International Journal of AI, Big Data, Computational and Management Studies, a role that reinforces a simple standard: if a system cannot be explained clearly, it usually cannot be operated safely. In this interview, Saumya shares practical lessons from high traffic platforms, using one high stakes project as a concrete case study.

Saumya, thanks for joining us. When you say “resilient distributed systems at scale,” what does resilience actually mean in practice?

Resilience is the ability to stay correct and available while the system is under pressure, and while it is changing. At scale, you do not get to choose between performance, reliability, and governance. You design so the system stays dependable, the data stays trustworthy, and the operational story stays clear when something goes wrong or when you need to evolve the architecture.

For readers outside infrastructure, what is your elevator pitch for the kind of systems you build?

I build the systems that keep large digital experiences fast, reliable, and efficient. The work is usually invisible when it is done well, but it shapes whether customers can complete basic actions without friction and whether teams can operate services without constant firefighting.

What is the big problem your work tries to solve, and who benefits most from solving it?

The problem is scaling complex distributed systems so they remain reliable and cost efficient while supporting high growth, mission critical operations. End users benefit because they get services that behave predictably. Organizations benefit because they can grow without being trapped by technical debt and operational overhead.

You led a major project in 2017 to 2018 to modernize Amazon’s global gift card claim code storage. What was the system responsible for?

It stored and served gift card claim codes globally across four regions: North America, Europe, Japan, and China. The system held around 2 billion claim codes representing about 50 billion dollars in financial value, and it served roughly 500 transactions per second in writes and about 1,000 to 2,000 transactions per second in reads. Those claim codes function as digital currency, so security, availability, and integrity are not optional.

What made this initiative especially difficult compared to a typical migration?

The existing system was tightly coupled to an Oracle backed monolithic service, and dozens of dependent services assumed that behavior. At the same time, the data had to meet strict financial compliance requirements around secure storage, auditability, fraud prevention, and access control. The scale was global, and the migration had to be done with zero downtime and without rewriting a large number of downstream systems.

Instead of building a new service from scratch, you extended Amazon’s Promotions Platform. Why was that the right move?

Building a new service would have increased risk and time because it would require rewriting extensive logic across multiple codebases that were coupled to the existing Oracle patterns. I looked for a solution that could meet the compliance and availability bar while reducing the amount of new surface area we introduced. The Promotions Platform already supported large scale code storage and validation, so the path was to generalize it to support gift card claim code semantics, including single use behavior and the financial sensitivity of the data.

How did you execute the migration without downtime at that scale?

The approach was incremental and controlled. New claim codes were generated through the enhanced platform, legacy codes were progressively backfilled, and the read paths were routed across systems until full cutover. We used a configuration controlled dual read and dual write strategy so we could transition safely without disrupting customers or dependent services.

A lot of companies are trying to modernize core systems at once. How did the broader market reality shape your decision making?

Modernization pressure is becoming the default. Gartner forecasts worldwide public cloud end user spending will total $723.4 billion in 2025, and that kind of investment level pushes teams to move faster while still being accountable. In that environment, reuse and consolidation can be a disciplined strategy. You reduce risk by improving a proven platform instead of inventing a new one, as long as you do the governance work to make the platform fit the problem.

What helps you tell the difference between an engineering approach that is merely impressive and one that is genuinely dependable in production?

It comes down to evidence and clarity. I look for what changed, which trade-offs were made explicit, and what proof exists that the system behaves correctly under real conditions, not just in a design review. In my role as a judge for the Business Intelligence Group, I am trained to evaluate outcomes based on measurable impact and explainable decisions. That discipline carries directly into distributed systems work, because at scale you need designs you can defend with controls, metrics, and operational traceability, not designs that only look elegant on paper.

What was the impact once the project was complete?

Amazon unified gift card and employee discount code claim code systems into a single NoSQL backed, financially compliant platform. The migration enabled Amazon to retire two legacy services and consolidate claim code creation, storage, and validation into one scalable system. It also reduced engineering load across teams, improved operational efficiency, and removed a major bottleneck that had constrained regional independence and service growth.

When systems represent stored value, security always becomes part of the conversation. How do you frame that risk?

I treat security and auditability as first order requirements. IBM reports the global average cost of a data breach is $4.4 million. When you are operating systems that store financial value, the cost of getting governance wrong is not just financial, it is trust. That is why you design for encryption, access control, and auditability from the start, and you build migration plans that preserve correctness while the system evolves.

How do you use writing and explanation as part of your engineering practice, especially when you are working on systems that have to stay reliable under real load?

Writing forces you to make the system legible. When you can explain scale mechanics clearly, you usually design them more cleanly too, because the gaps show up immediately. I have published engineering work on HackerNoon, including “Building a Distributed Timer Service at Scale: Handling 100K Timers Per Second,” and the reason I write is the same reason I build: predictable behavior under pressure. Whether it is a timer service or a claim code system, reliability comes from disciplined design choices and migration strategies that treat correctness as the main feature.

If you had to leave readers with the “lessons from high traffic platforms” from this story, what would they be?

First, treat correctness as the performance feature, especially when the data represents value. Second, migrations should be engineered as controlled transitions, not as single events, so customers do not experience the churn behind the scenes. Third, scaling is not only about throughput, it is about clarity: clear ownership, auditable behavior, and platforms that reduce complexity instead of multiplying it. When those are true, the system earns the right to grow.