Building Reliable AI Agents: Why Enterprise Automation Demands Systems That Behave With Determinism

AI agents have evolved from experimental workflows to foundational infrastructure in the hiring ecosystem. They now operate inside thousands of employer systems that were never designed for automation, where unpredictable behaviour remains the norm rather than the exception. Industry research in 2025 captured the scale of the challenge: According to a 2025 global survey, though nearly 90 % of organizations report AI use in at least one business function, only a small fraction have successfully scaled agent-based automation beyond pilot stages. Reliability has become a systems problem first, and an AI problem second.

This is the environment in which Cornelius Renken, Technical Product Leader for AI Applications at Kombo and an editorial board member for the SARC Journal of Innovative Science, builds. His work focuses on designing agents that complete job applications across employers with radically different validation rules, dynamic UI states, and authentication flows. His leadership became particularly visible through internal reliability initiatives that gave Kombo its first unified visibility into customer onboarding and reshaped how reliability is measured across different integrations.

We sat down with Renken to understand what determinism requires, how real-world automation failures surface at scale, and how AI Apply has evolved into one of Kombo’s most dependable systems.

AI agents now operate inside systems that change daily. From your perspective, what makes reliability uniquely difficult in this environment?

Reliability becomes difficult the moment automation interacts with systems it does not control. Employer platforms change without warning—new fields appear, validation rules adjust, and authentication paths diverge based on configuration. Even small variations can alter how an agent behaves.

Traditional automation assumes stability, which is why scripts and rigid workflows tend to fail as soon as the environment shifts. AI agents can interpret changes more flexibly, but interpretation alone does not guarantee consistency. An agent may understand a form perfectly yet still produce different outcomes across runs if the surrounding system behaves unpredictably.

In AI Apply, we treat every interaction as a dynamic environment. The agent builds a structured representation of each interface, validates it, and chooses a path based on behavioural patterns that remain stable even when the interface does not. Determinism is achieved by engineering predictable responses to unpredictable inputs.

Your early work on AI Apply challenged the idea that interface understanding is enough. What did that teach you about determinism?

It showed us that understanding is necessary but not sufficient. Many automation systems focus on the model’s interpretation, assuming that if it recognises the form, reliability will follow. At enterprise scale, reliability is measured not by one successful execution but by thousands that behave identically. That perspective was shaped both by building AI Apply and by reviewing production systems as a judge for the Globee Awards for Impact.

One of our first major accomplishments was formalising the separation between reasoning and execution. The model interprets the environment, but the system determines how decisions are validated, how recovery works, and how errors are escalated. This prevents behavioural drift even when the interpretation layer varies slightly.

We also learned the importance of self-diagnosis. When the environment presents something unexpected, the agent must recognise it, pause, or choose a safe fallback rather than improvising. Deterministic systems do not try to be clever; they try to be consistent.

Was there a moment when you realised that model accuracy alone would not be enough to make AI Apply enterprise-ready?

In the early stages, the focus was on whether the agent could understand and complete an application correctly. That worked well enough for an MVP. But as usage increased, a different issue surfaced. The same form, executed multiple times, could lead to slightly different outcomes. Each run looked reasonable, but the lack of consistency became obvious once the system operated at scale.
That was the point where correctness stopped being the goal, and repeatability became the standard. At enterprise scale, small variations compound quickly. A system that behaves differently from one execution to the next becomes difficult to trust, even if each decision appears valid.

From there, the work shifted toward designing AI Apply as a deterministic system rather than a purely interpretive one. The emphasis moved to enforcing consistent execution paths, clear validation rules, and predictable recovery behaviour. Reliability was no longer defined by whether the agent could succeed once, but by whether it could produce the same outcome across thousands of executions in changing environments.

That shift fundamentally changed how the platform was engineered. Determinism became an intentional system property, bringing AI Apply closer to infrastructure-grade reliability rather than experimental automation.

Observability seems foundational to how AI Apply is designed. Why is it so central to deterministic behaviour?

Observability is the anchor for understanding every decision the agent makes. Systems that operate without visibility force teams to fix symptoms rather than causes. With the right observability tooling, every behavioural anomaly becomes traceable across agents, customers, and employer platforms.

Previously, working on initiatives supporting Kombo’s Connection Flow gave Kombo its first analytics baseline. It linked transitions across each state of the onboarding flow to structured events, allowing us to identify failure patterns across tools, regions, and customer types. This shaped how we designed AI Apply: every workflow now emits signals that describe both interpretation and behaviour.

Public research from Splunk shows that organisations adopting unified observability can reduce mean time to resolution by as much as 37%, a shift that directly strengthens reliability in automated workflows. Greater visibility becomes the foundation for treating automation as infrastructure rather than experimentation.

How do you design agent behaviour that stays predictable when employer systems themselves remain inconsistent?

The industry increasingly recognises that platform inconsistency is a major source of automation failure, and spending patterns reflect that shift. Open research from MarketsandMarkets shows sustained investment in intelligent process automation technologies, projecting the market to grow from USD 13.6 billion in 2022 to USD 25.9 billion by 2027 at a CAGR of more than 13%. This rise signals a broader move toward automation systems built to handle variability reliably rather than experimentally.

Predictability starts with constraints. In AI Apply, agents follow strict behavioural patterns governing how they validate inputs, respond to ambiguous states, and recover from unexpected behaviour. These patterns ensure that even if the environment shifts, the agent’s behaviour remains stable.

This work marked a broader shift in how AI Apply was positioned: not merely as an automation engine, but as a stability layer for job boards and enterprise systems. The platform no longer functions solely as an automation engine; it shields partners from variability across employer platforms. That shift from automation to stability is what makes the system infrastructure-grade.

Looking ahead, what will define the next generation of agentic enterprise automation?

The next phase will be defined by reliability, not novelty. Models will continue to improve, but at enterprise scale what matters is whether a system behaves consistently under change. An agent that works only when the environment is ideal is not an enterprise tool. The real test is how it behaves when the environment evolves.

AI Apply is moving toward deeper self-correction, runtime inspection, and ecosystem-aware design for that reason. The reliability work across the platform; shared visibility, measurable baselines, and a clearer understanding of failure modes is aimed at making behaviour predictable rather than impressive. I have written about the same constraint in my HackerNoon article” Why Pure AI Agents Fail in B2B (and How To Build Deterministic Workflows)” on why pure AI agents fail in B2B settings and why deterministic workflows are required for production systems.

The systems that succeed will not be those with the most advanced models, but those that create environments where intelligence behaves consistently and earns its place inside mission-critical processes. That is where the industry is heading, and where our work continues to focus.

Say something nice to DonaldGreene…

1

Interesting point about reliability being a systems problem first and an AI problem second. Once agents hit real systems that change frequently, predictability drops fast. Curious if others found the same when moving beyond pilot workflows.

Interesting_Ride2443

·
4 months ago
·
1

The point about separating reasoning from execution is the key insight here. Most AI agent teams treat the model as both the brain and the hands. The model decides what to do AND does it. That is where non-determinism creeps in, because the execution layer inherits all the variance of the reasoning layer.

We are seeing the exact same pattern in AI code generation. The model reasons about what code to write, then generates it. But generation is inherently non-deterministic. Same prompt, different code every time. For enterprise automation that is a dealbreaker, same as you describe. And for code, it creates a hidden review tax where developers have to manually verify every output because they cannot trust that it matches previous approved patterns.

The framing of determinism as "an intentional system property" rather than something you hope the model provides is exactly right. You have to engineer it into the system architecture, not pray for it from the model weights.

We are building a coding agent for frontend developers where this is the core design principle. Same input, same output, every time. If the deterministic layer comes first and the model operates inside those constraints, you get reliability that actually scales.

EvgheniD

·
4 months ago
·