When I started building AI agents, I thought the main challenge would be making them actually work.
- Reasoning properly.
- Using tools correctly.
- Completing tasks end to end without breaking.
That part is still hard, but it was not the real problem.
Things changed when we scaled Agent37 and moved to multiple agents running in parallel. Research, coding content, and background workflows all running at once.
- Individually it was fine. Together, harder to manage than expected.
- The question stopped being “can the agent do the task?”
It became:
What is running right now?
What is stuck?
What needs review?
What failed while I was away?
That is when it clicked.
The problem was not intelligence anymore. It was visibility and supervision.
Most of the work is now about building systems you can actually track and control, not just make smarter.
Did anyone else feel this shift from building agents to managing them?
At what point did supervision become harder than capability for you?
Yeah this hit. Different problem domain but the same shift happened to me as a solo founder running three live products at
once. The worst part wasn't that the products were hard to build. It was that once they were all live, I had no idea
which one needed attention on any given day.
What worked was dumb but effective. I forced all three to push the same 4 numbers into one sheet. Traffic in, engagement,
conversion, revenue. Same schema. Now I can scan in 30 seconds instead of opening three GA4 tabs and reloading mental
context for each product every time.
Yours sounds harder though, because at least my products share a basic shape underneath (user comes in, does a thing,
maybe comes back). Research vs coding vs content agents don't even produce the same type of output. Did you find a way to
flatten observability across them, or do you just accept that heterogeneous agents need per-type dashboards?
That's a great example. What you're describing is very similar to the problem we ran into reducing context switching was often more valuable than adding more information. We ended up leaning toward a shared workflow view rather than per-agent dashboards. The outputs were different, but the questions were surprisingly similar: what's running, what's blocked, what's ready for review, and where does attention belong? The interesting part is that the more heterogeneous the work became, the more important those common states felt. Otherwise you end up managing every workflow differently and the cognitive load explodes.
ideas require action
This tracks with what happens once any system stops being a toy and starts being depended on. Building one agent that works is a weekend project. Knowing what 5 agents are doing right now without staring at logs all day is an entirely different engineering problem.
The audit trail and approval gates point is the part most people skip until something breaks badly. Visibility feels like overhead when you're moving fast, until the day it's the only thing standing between you and a quiet disaster. What was the first failure that made you realize you needed real supervision instead of just trusting the output?
For us it wasn't a dramatic failure as much as a growing accumulation of uncertainty. We'd come back after a few hours and realize we couldn't quickly answer basic questions: Which tasks actually finished? Which outputs still needed review? Which workflows had quietly stalled? The agents were often doing the work correctly. The problem was that human attention didn't scale with the number of workflows, and that's when supervision started feeling more important than capability.
Felt this hard. For me the shift hit way before multi-agent scale though — it was the first real client call. The moment someone is paying and their reputation is on the line, "can the agent do it?" stops mattering and "what did it do while I wasn't watching?" becomes the whole job. I now spend more time on call summaries and review tooling than on the agent itself. Capability got cheap; trust didn't.
Capability got cheap; trust didn't. That's a great way to frame it. What's interesting is that your supervision problem showed up with a single agent and a real client, while ours showed up when multiple workflows started running in parallel. Different paths, same destination. Once outcomes matter, people stop asking "can it do the task?" and start asking "can I trust what happened while I wasn't looking?" That's where review, visibility and traceability start becoming the real product.
I think this maybe a pattern amongst builders in general, ideas require action.... some times till 3am. Keep building amazing things!
Absolutely. Building is usually the fun part, it's turning ideas into something real that keeps us going even during those late-night sessions. Thanks, and wishing you the same on your journey!
This hits exactly where I am right now. I spent weeks obsessing over whether my agent would reason correctly — then the moment I had two running in parallel, I realized I had no idea what either of them was actually doing.
The shift from “did it work?” to “what is it doing right now?” is huge and nobody talks about it enough. Capability is almost a solved problem compared to observability.
I’m currently building in the AI tools space and this is making me rethink what the real pain points are for people using agents day to day. Visibility might be a bigger gap than intelligence for most teams.
Great post Vishnu — following your Agent37 journey closely.
Really appreciate that, and honestly that's very close to the experience that led us down this path.
We spent a lot of time thinking about capability, but the moment multiple workflows started running at once, the questions changed completely. Instead of "can it do the task?" it became "what's running, what's blocked, and where does human attention actually belong?"
I'm still not sure whether visibility ends up being a bigger market than capability, but it's definitely a pain point that showed up much earlier than I expected.
What you're building in the AI tools space—have you started running into these coordination and supervision challenges yet, or are you still mostly focused on individual workflows?
The building part is usually the fun side of indie hacking — curious what caught you off guard?
For me, it was realizing that building the workflows wasn't the hard part. Once several were running at the same time, keeping track of what was happening became more challenging than getting them to work in the first place.
Felt this hard, just from a different angle. I build client-facing voice/WhatsApp agents (one at a time, not parallel), and even there the model doing the task was solved fast. The real work became supervision: knowing what it actually said on each call, the handoff when it doesn't know something, and catching the rare wrong answer before a real customer sees it. "Can it do the task" took a week. "Can I trust it unsupervised with a paying client" is the part that never fully ends.
What's interesting is that even in a single-agent setup, the supervision problem shows up surprisingly quickly once real users are involved.
I like your framing that "can it do the task?" gets solved relatively fast, while "can I trust it unsupervised?" becomes an ongoing process. It seems the challenge isn't capability alone, it's building enough visibility and guardrails that you're comfortable relying on it in production.
I felt this shift very clearly.
At first I also thought the hard part was “can the agent finish the task?” But once I started using agents across a real product, the harder question became “can I still understand the state of the system?”
I’m building an iOS app as a solo founder, with separate work across product, iOS, backend, frontend, SEO, analytics, and launch operations. A single agent doing one task is manageable. Multiple agents touching different layers of the product quickly creates a new management problem:
The biggest lesson for me is that AI agents need an operating system around them: issue tracking, decision logs, clear ownership boundaries, review gates, deployment records, and a way to reconstruct why something happened.
So yes, supervision became harder than capability once agents moved from “assistant for a task” to “parallel workers inside the company.”
The strange part is that this starts to look less like prompt engineering and more like management design.
This resonates a lot. I particularly like the idea that agents eventually need an operating system around them, not just better prompts or models.
The questions you listed are very similar to the ones we started asking once workflows became long running and spread across multiple areas. At that point, understanding state, decisions, ownership and review history became just as important as execution itself.
And I agree with your last point. The further we go, the more this starts to feel like management and operations design rather than a pure AI problem. The technology gets the work done, but the challenge becomes coordinating and supervising it effectively.
This is a really interesting insight.
Most people focus on making a single agent smarter, but the real challenge seems to be visibility and coordination once multiple agents are running.
I'm curious — what tools or workflows have helped you the most for monitoring and managing multiple agents in production?
That was actually the problem that pushed us to build a Mission Control layer around our workflows. The biggest improvement came from having a single place to see task state, review queues, progress and blockers rather than jumping between individual sessions. We found that reducing context switching was often more valuable than making any single agent slightly smarter.
Are you already running multiple agents in production, or are you still mostly working with individual workflows?
This matches what I lived running a managed services business for almost 20 years. The hard part was never any single server doing its job, it was knowing the state across hundreds of them when something failed quietly at 2am. Agents make it worse because a crashed process pages you, but an agent that confidently does the wrong thing just sits there looking healthy. One thing that helped us: stop monitoring everything equally and build a single "needs a human right now" queue, so the real signal isn't buried under green dashboards. Capability you can buy, supervision you build. That line is right.
That's a great perspective. The idea of a single "needs a human right now" queue resonates a lot with what we experienced. I also like your point about agents appearing healthy while producing the wrong outcome. Those cases are often much harder to detect than outright failures, which is why supervision ended up becoming such a big focus for us.
ran into this with my agent stack. I've spent a month trying to review everything - just got alert fatigue. switched to exceptions-only: agent pings me only if something's outside a threshold I defined. the hard part is calibrating that threshold, takes a few weeks of real runs before you can trust the quiet
I went through the exact same realization. Getting the agent to do the thing is one thing, but when you have a handful of them running in parallel it turns into a completely different problem. The stuff that keeps me up now isnt the agent logic, its the questions around what state each one is in and whether they actually finished what they started. We threw together a simple status view just to see what is running and what is stuck, and honestly that single view made more difference than any prompt engineering I have done. The infrastructure side is where the real work is once you move past prototypes.
We had a very similar experience where a simple view of what was running, blocked, ready for review, or completed ended up being more valuable than expected. It's interesting how quickly the focus shifts from improving individual workflows to understanding the overall state of the system. Once multiple agents are involved, visibility starts to feel like a prerequisite for everything else.
ran into this. capability isn't the hard part. 10 agents in parallel and the real question becomes who owns oversight - what gets reviewed, by whom, and when. nobody has a good answer yet.
I agree. The oversight question feels much less solved than the capability question right now. Once multiple workflows are running, deciding what deserves human attention and when to intervene becomes a challenge of its own. It seems like the real goal isn't reviewing everything but making sure the right things surface at the right time.
Great read! Your content is clear, engaging, and valuable. Wishing you continued success with your blog.
Thanks for the encouragement. Much appreciated!
this matches what we're seeing too. the individual agent problem is mostly solved at this point — the hard part is knowing whether the output is actually good when you have 5 of them running. execution scales easily, quality evaluation doesn't.
That's a great way to frame it. We found that the challenge was not whether the workflows could produce output but whether we could efficiently identify which outputs needed attention and which were ready to trust. Execution got easier as the systems improved. Evaluation and supervision didn't scale nearly as cleanly.
This is very relatable. The agent usually is not the problem anymore, keeping track of multiple workflows is. Once you have several things running at the same time, knowing what's finished, what's stuck, and what needs attention becomes a challenge on its own.
That was the surprising part for us as well. The workflows were generally doing what they were supposed to do but maintaining visibility across everything became increasingly difficult as usage grew.
This may be a dumb question: isnt the inherent idea behind agential AI that execution will be no concern and only oversight is left?
Not a dumb question at all. I think that's the direction many of us are aiming for. What surprised me was that even as execution improved, oversight didn't disappear it just became a different problem.
Instead of asking "can the agent do this?", we started asking "what is it doing right now?", "what needs review?" and "where should I focus my attention?". That's where the supervision challenge started to show up for us.
Completely agree that the bottleneck shifts. Getting an agent to do work is one thing, managing ten of them is another. Coordination and oversight start taking more effort than the actual task execution.
Exactly. A single workflow is usually manageable but once multiple agents are running across different projects, coordination becomes a challenge of its own. We found ourselves spending more time tracking work than actually initiating it.
Feels similar to traditional operations problems. Once the system scales, visibility and control become more importabt than raw capability. The technology improves but the management layer becomes the new challenge.
Yeah we spent a lot of time thinking about agent capabilities but once multiple workflows were running in parallel, visibility became the bigger problem. The challenge shifted from execution to supervision.
This is the same shift that happens when you scale a team, not just agents. With one person you are the supervision. With ten, you can't be in every room, so you stop managing tasks and start managing exceptions. The systems that scale surface "here is what deviated from expected," not "here is everything that ran." Status dashboards don't scale, exception alerts do. The trap I watch founders fall into is building observability that shows them more, when the real win is observability that shows them less and flags only what needs a human. For me supervision got harder than capability the moment I couldn't personally eyeball every output.
The shift I hit is one layer past visibility: knowing what ran isn't knowing it was any good. A dashboard says ten agents finished, but the expensive failure is the one that finished, looks done, and is quietly wrong, and that never trips a status light. So supervision became less about live monitoring and more about a cheap per-task check against what the output should have been.
That's a really good distinction. A workflow can be perfectly visible and still be wrong. We ran into something similar where "completed" turned out to be a much weaker signal than we initially assumed. The hard cases weren't failed tasks, they were tasks that looked successful enough to move on, but still needed validation before anyone was comfortable acting on the output. I like your framing of supervision as checking outcomes rather than monitoring activity. In a lot of cases, the real question isn't "did the agent finish?" but "did it produce something trustworthy?"
when i started i also felt like building AI agents is hard but slowly i started learning and now the same situation happened to me of managing them
I felt something similar, although in a very different domain.
While building Sleuth (an AI-assisted reconciliation investigation tool), I initially thought the hard part would be getting the model to identify the reason behind a ledger discrepancy.
It turned out the harder problem was making the investigation traceable.
A finance team doesn't just want an answer. They want to know:
The challenge shifted from generating explanations to generating explanations with evidence.
In a way, it sounds similar to the shift you're describing. Once capability becomes "good enough," visibility, supervision, and trust become the bigger engineering problems.
Curious how you're handling review and approval workflows when multiple agents disagree with each other.
This resonates a lot. In regulated or high-trust environments, an answer without evidence is often only half the solution.
I like your distinction between generating explanations and generating explanations with evidence. That feels very similar to the shift from execution to supervision. Once capability reaches a certain threshold, the focus moves toward traceability, verification, and trust.
On the disagreement question, we've generally found that disagreement itself is often a useful signal. If multiple agents arrive at different conclusions, that usually becomes a review event rather than something we'd try to resolve automatically.
It sounds like you're dealing with a similar challenge in finance. If two investigation paths point to different root causes, do you surface both to the reviewer, or do you have a way of ranking confidence while still preserving the underlying evidence trail?
Really well put. The shift from "can it do the task?" to "what is it doing right now?" is one of those things you don't see coming until you're already in it.
We hit the same wall running parallel AI workflows for clients. What surprised me most wasn't the complexity of the agents, it was how quickly a simple status view (running, blocked, needs review, completed) became more valuable than any prompt optimization.
Curious! for your Mission Control layer, did you find a specific signal that reliably separates "let it run" from "needs human now"? That calibration seems to be the part that takes the longest to get right.
That's been one of the hardest parts for us too. We haven't found a single reliable signal yet. In practice, anything with uncertainty, conflicting outputs, or meaningful downstream impact tends to go to review. Most interventions aren't caused by failures, they're caused by ambiguity. Curious if you've found any signals that consistently predict when human review is needed?
Honestly the calibration took longer than building the agent itself. What ended up working wasn't a single signal but a few layers:
Everything else runs unattended until the daily summary. The threshold tuning happens in production - you start conservative and relax as you build trust in the patterns. The real unlock was making 'uncertainty' itself a review trigger rather than trying to define every possible error.
One thing we've noticed as well is that ambiguity tends to be a much better indicator of human involvement than outright errors. Most errors are obvious. The harder cases are the ones where the agent could continue, but a quick human decision would save a lot of wasted effort. It feels like the long-term challenge isn't detecting failures, it's deciding when human attention creates more value than additional agent execution.
The "fails quiet" part is the whole thing. I build monitoring for Meta ad accounts and the failures that cost real money never throw an error, the account just keeps spending confidently on the wrong thing until someone reads the numbers. Loud failures basically solve themselves. The hard part is judging output quality when there's no clean pass/fail, and gating the agent's actions so a confident-wrong move needs a human before it executes. "Capability you buy, supervision you build" is exactly right.
That's a great example. The most expensive failures are often the ones that look successful on the surface. I like your distinction between loud and quiet failures. Loud failures get attention automatically. Quiet failures are where supervision and review become critical. The challenge seems less about detecting crashes and more about detecting confident-but-wrong outcomes before they create downstream problems.
At least your hardest part is not marketing or getting it out for the public to see. I feel you need a couple of people to test and tell you what is lacking because to the best of my knowledge, nothing is perfect day one.
That's true. Feedback from real users is usually where the biggest gaps become obvious. Nothing is perfect on day one, and a lot of the insights behind this post only showed up after using the workflows in practice rather than just building them.
And honestly, marketing and distribution are their own challenge as well. Building the product is only part of the journey.
I think it's easy
It definitely seems simple on paper. Have you run into this with longer-running or parallel workflows yet, or are you mostly working with single-agent setups?
This resonates. We hit the same wall moving from a single “do everything” agent to multiple agents + background jobs—suddenly the hard part was observability: what’s running, what’s waiting on a dependency, what retried, what silently failed.
Curious: did you end up building a lightweight run dashboard/event log first, or did you go straight to a workflow engine/queue with tracing? We found even a simple state machine + per-step timestamps (and an “attention required” inbox) reduced the chaos a lot.
We actually started much closer to what you're describing. The first thing we needed wasn't a sophisticated workflow engine, it was simply a way to see task state, progress, and where attention was required. What surprised us was how much value came from having explicit states like running, review, blocked, and completed. Just making the work visible reduced a lot of the chaos. I like your "attention required" inbox idea. In many ways, that ended up feeling more important than detailed tracing because it answers the question humans care about most: where do I need to look right now?
Ok
Interesting shift.
The bottleneck is moving from intelligence to visibility.
An agent can complete tasks, but humans still need to understand:
What is repeating?
What is failing repeatedly?
What requires intervention?
I've started noticing that supervision itself creates patterns. The same failure loops, approval loops, and decision bottlenecks keep appearing across different workflows.
Curious: do you think future agent systems will need better reasoning, or better pattern visibility?
That's a great question. My intuition is that both will improve, but pattern visibility may end up being the bigger bottleneck for teams running real workflows. Better reasoning helps individual tasks succeed. Better pattern visibility helps you understand where time, attention, and trust are being lost across dozens of tasks. I think your point about recurring failure and approval loops is particularly interesting. Once those patterns become visible, they stop looking like isolated agent mistakes and start looking like system-level issues that can actually be improved. It wouldn't surprise me if future agent platforms spend as much effort surfacing operational patterns as they do improving model capabilities.
Most teams think they're debugging agents.
Often they're actually debugging organizational patterns the agents simply make visible.
AI exposes the loop. Humans decide whether to change it.
TruthLoop AI — Find what you're avoiding.
That's a really interesting way to look at it.
The more workflows we ran, the more it felt like agents were exposing coordination and decision-making bottlenecks that already existed. The agent wasn't creating the problem, it was making it visible.
Makes me wonder how many "agent failures" are actually process failures in disguise.
This mirrors something I keep noticing too — the capability bar keeps
getting lower while the supervision bar keeps rising. What's tricky is
that agents fail differently than normal software: a crashed service
pages you, but an agent doing the wrong thing just sits there looking
healthy. I think the most underrated investment right now is building
a dead-simple "needs human attention" queue rather than trying to
monitor everything. Have you found a threshold or signal that reliably
separates the workflows worth interrupting for versus the ones you can
safely let run?
I’ve noticed the same shift. Building capable agents is challenging, but managing multiple agents, tracking failures, and maintaining visibility quickly becomes the bigger problem. Observability and supervision are now just as important as capability.
Agreed. We also expected most of the effort to go into improving agent capability, but once multiple workflows were running in parallel, visibility, review, and coordination started consuming far more attention than the agents themselves. It feels like capability gets you to a working demo, while observability and supervision are what make it usable in the real world.
this is the ops gap nobody talks about. building one agent that works is basically a weekend project now. building a system where you actually know what 10 agents are doing, whether the output is any good, and when to step in... thats a completely different discipline. most teams skip straight to 'more agents' without building the evaluation layer first and then wonder why everything feels fragile.
This matches what we've seen as well. It's surprisingly easy to focus on adding more capability because that's the most visible progress, but the moment you have multiple workflows running, the evaluation and supervision layer starts determining whether the system is actually usable. I like your point about fragility. A lot of systems don't fail because the agents can't do the work, they fail because nobody has a reliable way to tell when output quality is drifting or when intervention is needed. It feels like we're slowly discovering that scaling agents and scaling trust are two very different problems.
The “hard part moved somewhere else” lesson is familiar. With Kinetic Override, the engineering was only half of it; the harder piece is making Android users trust the workflow quickly: no-root, local profiles, records taps/swipes, replays loops, no ads. Clear boundaries beat a clever feature list.
I like that framing. "The hard part moved somewhere else" seems to be a pattern across a lot of products. Your point about trust resonates as well. Users often don't evaluate a system purely on what it can do, but on whether they understand its boundaries and feel confident using it. In many cases, that ends up mattering more than adding another feature. It's interesting how often the challenge shifts from capability to clarity once the core technology is working.
same experience from a slightly different angle — the visibility problem shows up on the deployment side too.
the "can the agent do the task?" question has a production equivalent: "did the agent configure the security layer?" the answer is almost always no, because security config doesn't break the demo and doesn't show up in UI feedback loops.
ran a URL-based scanner across 47 production lovable/bolt/cursor apps. 31 had Supabase RLS completely disabled, 11% had secret keys in the browser bundle. none of the founders knew. the agent built the app correctly — it just didn't harden it, because that's not what it's designed to do.
the visibility problem you're describing in agents maps directly onto the "what's actually running in prod and is it hardened?" problem. tooling for both is still pretty early.
That's a really interesting parallel. In both cases, the system can appear to be working perfectly while important issues remain completely invisible until someone goes looking for them.
I like your point that the agent isn't necessarily failing at its task—it’s optimizing for what it was asked to do, while things like security, review, and operational safeguards sit outside that objective.
It does feel like the common thread is visibility. Whether it's agent workflows or production systems, the challenge becomes understanding what state the system is actually in rather than whether it can perform a specific task. The silent failures are usually the ones that hurt the most.
Same experience here. The build part has a clear finish line — you ship and it's done. Distribution has no finish line, just experiments that work or don't. The humbling part for us was realizing that 'content marketing' doesn't mean 'write articles and wait.' It means building pages that answer the exact question someone types at 11pm when they have a problem. That shift from 'writing for readers' to 'writing for search intent' took way longer to internalize than any technical challenge.
That's a great way to put it. Building feels finite—you can point to a feature and say it's done. Distribution feels much more like an ongoing system that needs constant iteration.
I also like your distinction between writing for readers and writing for search intent. A lot of founders know they should create content, but the real challenge is understanding what people are actually searching for when they're trying to solve a problem.
Was there a particular change in your content strategy that made that lesson finally click for you?
As a beginner web developer, this concept of 'visibility over intelligence' really hits home, even on a much simpler scale.
I recently built a simple visual parser for regular expressions because I realized that writing the regex wasn't the hardest part—it was seeing and understanding what it was actually doing under the hood without feeling blind.
It seems like whether it's complex AI parallel workflows or just basic coding logic, we always underestimate how crucial visibility and supervision tools are compared to the core capability. Great insight!
That's a great example. I think the pattern shows up in a lot of different areas of software.
Your regex parser highlights the same underlying idea: capability matters, but being able to see and understand what's happening often ends up being just as important. Once systems become even slightly complex, visibility becomes the thing that makes them usable.
It's interesting how often the real challenge turns out to be reducing uncertainty rather than adding more capability.
Exactly! 'Reducing uncertainty' is the perfect way to put it.
Thanks for the encouragement.
This resonates. The supervision layer is where most of the actual work is now. On a related note - branding (logos, visual identity) is often the 'visibility problem' for indie hackers shipping products. We made https://www.ailogomaker.shop/ to help with that side of things. Would love to hear how others here handle the branding side while focusing on agent development.
That's an interesting way to think about it. Visibility definitely shows up in different forms across products.
For this post, I was mostly focused on visibility inside the workflows themselves understanding what agents are doing, what needs review, and where attention is required as systems scale. That's the area that caught us by surprise the most.
I hit the same wall.
At first I was obsessed with making agents smarter:
better reasoning, better tool use, better task completion.
Then I started running multiple agents in parallel and realized intelligence wasn't the bottleneck anymore.
The real problem became:
Feels like we went from building AI to building management software for AI.
The more agents you have, the more observability matters.
This mirrors our experience pretty closely. We started out thinking most of the effort would go into improving the agents themselves, but once multiple workflows were running, the questions shifted almost entirely toward visibility and supervision.
I like your phrase "building management software for AI." That's honestly what it started to feel like for us as well. The challenge became less about whether an agent could do the work and more about knowing where attention was needed at any given moment.
Have you found a good way to surface the few workflows that actually need intervention, or are you still experimenting with that balance?
Still figuring it out.
My current thinking is that humans shouldn't monitor agents. Agents should monitor themselves and only escalate when confidence drops, progress stalls, or a decision is needed.
The hard part is getting those escalation thresholds right.
I like that direction. If humans end up watching every workflow all the time, we've probably just recreated a different kind of busywork. My only hesitation is that agents can be very confident when they're wrong, which makes escalation based purely on confidence a bit tricky. My guess is the best signals will end up being a mix of stalled progress, uncertainty, conflicting outputs, and business impact rather than any single metric. The threshold problem feels a lot harder than the execution problem.
Marketing is, now
Great breakdown. I’m also working on a free browser-based image tools platform, and one thing I’m learning is that long-tail tool pages like exact-KB compression can bring early SEO signals faster than broad keywords.
That's an interesting observation. I've seen a similar pattern where highly specific pages often attract users with much clearer intent than broader keywords.
Curious how you're deciding which long-tail tools to build next , are you using search data, user requests, or just testing different ideas and seeing what gains traction?
I am using search data and some of my tools which I developed by myself
That makes sense. Combining search data with your own tooling is probably a strong advantage since you can spot opportunities that generic keyword tools might miss.
Yes. For me the supervision line got crossed when agent work stopped being interactive and started running in parallel.
The thing I now want surfaced is not just task status, but budget burn and repeated loops:
If I were designing the control room, every task would have a tiny health card: goal, last meaningful output, current blocker, estimated spend, and next human decision needed. That is usually enough to know whether to let it run, redirect it, or kill it.
I like the idea of a health card. A lot of the issues we ran into weren't hard failures, they were situations where a workflow was technically active but no longer making meaningful progress. The points around repeated loops, unclear ownership and spend visibility resonate a lot. Those signals often seem more useful than simply knowing whether a task is "running" or "completed."
Agreed, but i think this will make the expectations much hire, even from ourself
Agreed. Better capabilities tend to raise the bar rather than lower it. The more we trust these systems, the more we expect them to be understandable and reliable as well.
Hit the same wall from a different angle.
Built a Telegram RAG chatbot — the AI reasoning part
worked great. The part I was totally wrong about:
distribution. Spent 2 weeks and $200, got zero users.
Reddit blocked me (new account, no karma).
Facebook groups — post in moderation for 3 days.
Paid ads — 128 impressions, 0 clicks.
The product wasn't the problem. I just had no idea
how to reach people.
Now building LaunchOSbot — a bot that gives developers
a step-by-step promotion plan with actual post texts
ready to copy. Because "build it and they will come"
is the biggest lie in indie hacking.
Curious — how are you handling the distribution side
after you solve the agent oversight problem?
I think this is one of the biggest lessons teams learn once they move beyond demos.
The first question is "Can the agent do the job?" The next question is "Can I trust and monitor it at scale?"
In a few multi-agent systems I've worked on, including some projects at IT Path Solutions, visibility ended up being a bigger engineering challenge than the agents themselves.
Completely agree. That shift from capability to trust and visibility was what surprised us the most. It's good how often the conversation starts with model performance, but once these systems are running in real workflows, supervision and observability become the bigger engineering challenge.
This is a very real shift, and a lot of people hit it once they move past single-agent demos.
At small scale, you’re debugging reasoning and tool use. At multi-agent scale, you’re basically running a distributed system — and suddenly the hard part isn’t “can it do the task,” it’s “can I understand what the system is doing at any moment.”
That’s where visibility becomes the real bottleneck: state tracking, failure recovery, async execution, partial success, silent errors, all of it becomes more important than model quality.
It’s also why a lot of agent systems quietly converge toward orchestration layers, logs, queues, and dashboards — basically everything traditional distributed systems already solved, just with AI in the middle.
So yeah, that shift is very common: intelligence stops being the limiting factor surprisingly early, and supervision + observability becomes the real product.
Well said. The comparison to distributed systems resonates a lot with what we experienced. We went into it thinking the main challenge would be improving agent performance, but once multiple workflows were running in parallel, visibility, coordination, and review quickly became the bigger concern. It's been interesting to see how many of the solutions start looking like classic orchestration and observability patterns.
nice one
This is the same shift I keep seeing: once the demo works, the hard problem becomes supervision and proof.
For a website agent, I do not think "can it answer?" is enough anymore. The better questions are:
I’m seeing this while working on AnveVoice for Shopify sites. The voice layer itself is not the whole product. The real product is the loop around it: page-specific context, transcripts, routing, uncertainty handling, and a metric the merchant actually trusts.
Capability gets the first demo. Supervision is what makes it sellable.
good one
Felt the same. The second you go from one agent to a bunch running at once, you're not building an AI product anymore, you're running a system, and the hard part of running any system was never one piece doing its job, it's knowing what's happening across all of it while everything runs and things quietly break in the background. So half of what you're hitting is an old problem in a new coat. What's running, what's stuck, what failed while I was gone is something ops people have been fighting for years, it's the whole reason logs and alerts exist. But agents are worse than normal services, and here's why: a normal service fails loud, it crashes, throws an error, the dashboard goes red, but an agent fails quiet, it doesn't crash, it just does the wrong thing with full confidence, or burns money going in circles, or hands you output that looks fine until you actually read it. A health check won't catch that, you need something checking whether the output is actually good, and that has no clean yes or no, which is the genuinely new part, and no tool you can buy fits your exact workflow. When did it flip for me? When I couldn't just sit and watch anymore. One agent, you are the supervision, you're in the room, you see it, but five agents you can't be, so you have to build that watching into something real, and building good tooling for what needs your eyes right now is the boring work nobody hands you a library for. Capability you can buy, supervision you build yourself.
Currently working on it. Thanks!
The real challenge starts when agents stop being demos and become systems. Capability gets attention, but observability, orchestration, and supervision become the actual bottlenecks.
Right now I'm trying to work on the visibility part of things, and I'd love to know where you're struggling on visibility into your agents!
Makes sense. One agent is a capability problem. Multiple agents become a coordination problem.
it was nice content and it is good point of view
Thanks for giving it a read.
This comment was deleted 2 days ago.
The flip happened for me when I had two agents touching the same resource and realized I couldn't tell which one had last modified it. A single-agent loop you can reason about linearly; parallel ones require explicit coordination that the model itself has no incentive to maintain. Your framing — the question shifts from "can it do the task?" to "what is it doing right now?" — is exactly the right way to describe it. Supervision became urgent not when we added more agents, but the moment agents could block each other silently.
That's a great example. Once workflows start interacting with shared resources, the challenge shifts from task execution to coordination and visibility very quickly.
I like your point about agents being able to block each other silently. That's the kind of issue that's hard to notice until you're running multiple workflows in parallel. Have you found that better coordination mechanisms solved most of the problem, or do you still rely heavily on human oversight when workflows start overlapping?
what does stuck actually look like for you in practice? an agent that's silently looping, one that's waiting on an external API that never responds, and one that completed but produced garbage output are three different failure modes that need three different detection strategies. curious which of those you're catching reliably right now and which ones still slip through
Good distinction. For us, "stuck" was usually less about a specific failure mode and more about losing visibility into where attention was needed. API failures and stalled workflows are generally easier to identify. The harder case is often a workflow that technically completed but still requires review before anyone is comfortable acting on the output. Those are the situations that pushed us toward thinking more about supervision and review queues rather than just execution status.
the completed-but-needs-review case is the genuinely hard one because there's no error signal to hang the detection on, the agent thinks it succeeded. curious how you're deciding what gets routed to review versus what ships automatically. is it confidence-based, task-type based, or something the agent itself flags about its own output?
We ended up leaning much more toward task-type and workflow-stage based review than confidence scores. One thing we noticed is that agents are often confident when they shouldn't be, so using confidence as the primary signal felt risky. In practice, anything with meaningful downstream consequences tends to land in a review queue regardless of how successful the agent believes it was.
That's partly why we started thinking in terms of supervision rather than failure detection. The question became less "did the agent think it succeeded?" and more "would a human want to verify this before acting on it?"
I'm curious whether you've found confidence signals useful in production, or if you've run into the same issue where confidence and correctness drift apart.
Spot on. Kind of wild
Exactly. That shift happened much sooner than we expected.
Are you trying to validate the idea of building a GUI for all the agents running? If so you may find the following helpful:
I have tried that before. Didn't work out for me because many of the assumptions you have are built on the model's current capabilities. When the models become stronger, two things happen:
That's a fair point, and I can definitely see that becoming true for certain types of workflows as model capabilities improve.
The challenge we kept running into wasn't necessarily understanding every reasoning step, but maintaining visibility across multiple long-running tasks and projects at once. For us, it was less about inspecting agents and more about knowing what was in progress, what needed review, and where attention was required.
I do agree that as models get better, the supervision layer will probably evolve as well. The interesting question is how much visibility people will still want once agents become significantly more reliable
I would say for most average users, they don't care at all. For instance, if they are willing to go through the amount of trouble, most SaaS today should be dead by now because anyone can reproduce them with ClaudeCode given some time. But for those who cares, e.g. professional developers, I bet their population would be small enough for the big company to focus on them and integrate the visualization part into their coding platform (I believe ClaudeCode is doing that now with the new Desktop version and the introduction of "artifact" feature)
I think that's possible. If models become reliable enough, most users probably won't care about the details of execution.
The part I'm less certain about is whether visibility disappears entirely or just shifts levels. Most people don't inspect individual server processes either, but they still want dashboards, status indicators, approvals, notifications, and a way to understand what's happening when something goes wrong.
My guess is that as capabilities improve, people will care less about how agents work and more about which outcomes need attention. The supervision layer may become thinner, but I'm not convinced it goes away completely. What do you think happens once agents start managing dozens of long-running workflows across projects instead of a single task?
Well I think there is a narrow market space now but I remain pessimistic about it in the future. Here is the reasoning process:
Anyway, if you execute fast enough, I believe you will get some users that will be enough for you build up a reputation as a founder and gain some connections. But making it a large startup is unlikely.
That's a reasonable perspective, and honestly I wouldn't be surprised if a lot of coordination eventually gets abstracted behind a smaller number of interfaces or manager agents.
Where I keep getting stuck is that even in highly automated systems, someone still owns outcomes. The question may stop being "what are all the agents doing?" and become "what needs my attention right now?" but that still feels like a supervision problem to me.
I could definitely be wrong about the size of the market, though. The thing that made me curious was seeing how quickly these visibility and coordination issues showed up once workflows became long-running and crossed project boundaries. It felt less like a model limitation and more like an operational one.
It'll be interesting to see whether future systems eliminate that layer entirely or just compress it into a much smaller surface area for humans. I suspect we'll find out pretty quickly over the next few years.
Yes. I believe the human-in-the-loop paradigm will continue to exist for a really long time. But the solution could be in a form other than dashboard. I'm actually reading Agent 37 website and I can see why you find this question relevant. I'm also curious how people manage their agents nowadays. Let me do some research and get back to you.
If you do end up researching how people are managing agents today, I'd genuinely be interested in what patterns you find. My impression so far is that most teams have figured out how to run agents, but far fewer have figured out how to supervise them once they become part of everyday workflows.
Yeah this hits. Same exact thing happened with distributed systems, the second you go from one process to a bunch running at once the hard part stops being "does it work" and becomes "can i even see whats happening and jump in when something breaks."
Im in the cloud infra world and watching agents run into this is kinda wild. its the same supervision/observability problem infra spent like a decade figuring out, just way faster now since agents are non-deterministic on top of being parallel.
what'd you end up building for the visibility side? dashboards, logging, some
human in the loop thing?
That's a great comparison. The more we worked with multiple workflows, the more it started to feel like an observability problem rather than an agent problem. We ended up focusing on a mission control approach where we could see task status, review queues and workflow progress in one place. The goal wasn't to control every step but to make it obvious where attention was needed and when human intervention made sense.
This is very relatable. The agent usually is not the problem anymore, keeping track of multiple workflows is. Once you have several things running at the same time, knowing what's finished, what's stuck, and what needs attention becomes a challenge on its own.
Felt this exact shift. The moment you go from one agent to a fleet running in parallel, the bottleneck stops being "is it smart enough" and becomes "can I trust what it did while I wasn't looking."
What helped me most was flipping it from monitoring to designing for supervision up front — building the agents so the dangerous parts can't fail silently:
Dashboards tell you what's stuck — but that design is what stops a stuck or confused agent from quietly doing damage. Capability you can improve later; a fleet you can't see into burns you on day one.
For me the line crossed the moment agents could take real actions, not just generate text. Read-only parallelism is easy to babysit; the second they can do things, supervision becomes the whole game. Where did it tip for you — the number of agents, or the actions they were allowed to take?
I think it was a combination of both, but the number of concurrent workflows made the problem impossible to ignore. Individually, most tasks were manageable. The challenge appeared when multiple agents were running across different projects and we no longer had a clear picture of what was completed, blocked, or waiting for review. I completely agree on designing for supervision upfront. We found that visibility, review points and clear task status became just as important as the agent capabilities themselves. Once agents start taking meaningful actions, knowing when and where to intervene becomes critical.
This resonates. The shift you're describing — from "can the agent do the task" to "what's running, what's stuck, what failed while I was away" — is the moment an agent stops being a tool and becomes a team you manage. And managing always scales worse than building.
For us the inflection was that parallel runs make failure quiet. One agent failing is obvious; one of six failing silently at 2am is an ops problem, not an AI problem. What helped: treat every run like a job in a queue — explicit states (queued / running / needs-review / failed), one timeline view, and forcing each run to end in a reviewable artifact instead of just "done." Supervision got easier once "what happened" was a record, not a memory.
Curious what you landed on for the review step — human-in-the-loop gating per task, or let it run and triage after?
"One of six failing silently at 2am is an ops problem, not an AI problem" is a great way to put it. That's very close to what we experienced.
We ended up leaning toward explicit task states and a review queue rather than treating completion as the end of the workflow. Having a clear view of what was running, blocked, waiting for review or completed turned out to be far more valuable than simply knowing a task had finished.
For review, we've generally found human approval works best at key decision points rather than on every step. Otherwise the overhead starts defeating the purpose of the automation.
We ran into something similar. Once you have several tasks running at the same time, visibility becomes a much bigger issue than execution.Its easy to underestimate how much context switching happens when you are monitoring everything manually.
I feel like this is where most teams eventually end up. The more capable the agents get, the more important monitoring becomes. Reliability matters but knowing what's happening across the system matters just as much.
At what point did supervision become a bigger challenge than the actual agent performance? Was there a specific workflow or project that made this problem obvious?
It became obvious once we started running multiple workflows in parallel across different projects. Individually they were manageable but keeping track of progress, reviews and blocked tasks across all of them quickly became the bigger challenge.