When a backbone system goes down, the cost is rarely limited to IT overtime. One widely cited benchmark pegs average downtime at $540,000 per hour, which is why resilience work only counts when it survives the day something breaks for real.
Vamshi Krishna Jeksani, serving as a Senior Cloud Solutions Architect at a leading global cloud services provider, has spent years pressure testing what HA and DR look like when the stakes are real, and he brings an evaluator’s discipline to that work as a Peer Papers reviewer for the SARC manuscript entitled “Enhancing SAP S/4HANA and Salesforce Quality Assurance & Quality Control through AI Driven Test Automation Frameworks for Regulatory Compliance.” He has seen it in two very different shapes: the complexity of a large North American automotive enterprise’s heterogeneous SAP landscape, and the pace of a global enterprise data center exit program, where SAP ECC and eleven satellite applications had to move to AWS inside a fixed 16 month window. Those experiences, one defined by scale and one defined by the clock, are what shaped the automation first playbook he shares here.
What ties those efforts together is not a single architecture diagram. It is a repeatable operating playbook: define what must be standardized, document what must be tailored, and prove the whole system through drills and automation until the results stop depending on who is on call.
What does “automation first HA and DR” actually mean in your day to day work?
It means you do not treat recovery as a separate project that starts after migration. You bake it into the way you plan, build, and validate from the beginning.
For this large-scale automotive enterprise, the first reality check was complexity. The landscape spanned SAP ECC, BW, PO, and S/4HANA, across multiple databases and operating systems, and across three countries. If you try to handle that with one off procedures, you end up with fragile recovery that looks fine on paper and fails in the details.
the data center exit program, the pressure looked different. The constraint was time. They had a fixed window to move SAP ECC and eleven connected SAP applications, while improving security, availability, and operational resilience. Under that kind of deadline, manual work is not just slower. It is unpredictable. Automation first means you remove uncertainty early, so recovery behavior does not become a last minute guess.
You talk a lot about a “repeatable playbook.” What are the pieces you insist on every time?
I treat the playbook as a set of artifacts that a team can execute without improvising. First, you translate business goals into practical recovery behaviors. Not just targets in a document, but what fails over first, what must be validated, and what the team checks before they declare success.
Second, you write down the steps in a way that survives handoffs. When teams are under stress, they do not need to be clever. They need to be clear. That is why I push for runbooks that specify ownership, sequencing, and validation, plus the guardrails that prevent people from skipping the steps that matter.
Third, you prove it with drills. A playbook is not a promise until it has been rehearsed. That is where you find the gaps that only show up when dependencies fail in the wrong order, or when a small configuration detail changes the system’s behavior.
This mindset also shows up in how I engage with broader technical evaluation. I served as a Peer Papers reviewer for the manuscript entitled “Accelerating SAP Fiori Development with SAP Business Application Studio and BTP Services,” and I take the same stance in delivery: claims are only useful when the evidence matches the claim. A similar perspective carried into my role as a judge for the Dallas Regional Science and Engineering Fair 2026, where evaluating projects across disciplines reinforced the importance of clarity, validation, and real-world applicability in how systems are designed and assessed.
Where do you standardize, and where do you intentionally tailor?
Standardize anything that reduces cognitive load during an incident. Provisioning patterns, naming conventions, baseline controls, and the way you validate health checks should not change from system to system. Teams move faster when they recognize the mechanics instantly.
Tailor the parts that are genuinely workload specific: the dependency order, the criticality tiers, and the recovery workflow details that reflect what the business can tolerate. The large-scale enterprise environment’s landscape demanded disciplined assessment across different databases and operating systems, plus benchmarking and enterprise scale sizing for SAP HANA to ensure performance and stability. The time-bound migration program demanded migration sequencing and readiness practices that could keep pace with a portfolio move.
This split is also aligned with what many enterprises say they value most in cloud programs. In a recent consulting study commissioned by HashiCorp, 75% of respondents highlighted uptime and availability as important to cloud strategy success. You do not get there through bespoke heroics. You get there by standardizing the mechanics and tailoring only what truly needs it.
The migration program had a hard deadline and a portfolio scope. How did you enable execution at scale without burning people out?
You cannot ask a team to sprint for 16 months on manual effort. The only sustainable move is to teach patterns and make automation the default. In this case, the program involved migrating SAP ECC and eleven satellite SAP applications, with a footprint that included 72 EC2 instances and large scale storage. My role focused on high availability, disaster recovery, and automation, and a big part of that was mentorship. I worked with customer and partner teams on infrastructure automation using Terraform, on Sybase HADR configuration, and on backup automation. We also built setup and testing procedures so the HA and DR configurations were not just implemented, but validated.
When you mentor well, you stop being the bottleneck. People learn how to execute the pattern, how to test it, and how to explain it. That is when the program becomes resilient in a different way: it can keep moving even when the original experts are not in every meeting.
On the large-scale enterprise side side, that same execution discipline earned strong customer feedback for me individually and for the broader team, which I view as a signal that the work landed not just technically, but operationally.
The enterprise environment environment was complex in a different way. What did “automation first” unlock there beyond recovery readiness?
Once you remove fragility, you can focus on performance and cost without compromising stability. It had meaningful performance bottlenecks in SAP BW, and part of the stabilization work involved implementing BW on HANA with Active and Active Read capabilities, plus improving infrastructure placement for cost and performance. When you treat the platform as something you can build consistently, you can tune it consistently too.
The financial impact was also material. The estimated overall impact exceeded $750K to $1M within the first year, including over 30% reduction in compute infrastructure costs, roughly $200K in savings from SAP HANA optimization, and $150K annual savings from consolidation through co hosting staging and passive DR nodes. Those are not savings you get by chasing discounts. They come from designing the environment so it is sized correctly, operated consistently, and improved without introducing new risk.
If you had to give one piece of advice to teams building HA and DR for SAP today, what would it be?
Treat recovery as a product, not a document. If your runbooks are vague, your drills are optional, and your automation is incomplete, you do not have recovery. You have a plan that depends on the right people being awake at the right time.
And expectations are only rising. Gartner has projected that by 2028, 75% of enterprises will prioritize backup of SaaS applications as a critical requirement. The headline is about SaaS, but the direction is broader: businesses are moving from hoping data is recoverable to demanding proof that it is.
That is why I keep coming back to drills, guardrails, and automation. Recovery that is not tested is not real. Recovery that is not repeatable will not scale.
The real takeaway is simple. Recovery only counts when it is repeatable by the team that will actually be on call, not just explainable by the person who designed it. A clear runbook with named owners, drills that force the ugly edge cases to show up early, and automation that makes rebuilds consistent are what turn HA and DR from a promise into an operating habit. When those pieces are in place, resilience stops being a heroic moment and becomes normal execution.