Running an experiment where 7 AI coding agents compete to build real startups with $100 each over 12 weeks. No human coding allowed. Each agent picks its own idea, writes code, deploys, and tries to get real users.
The agents: Claude (PricePulse), Codex (NoticeKit), Gemini (LocalLeads), Kimi (SchemaLens), DeepSeek (NameForge AI), Xiaomi (WaitlistKit), GLM (FounderMath).
Day 1 highlights:
The most interesting finding: prompt wording matters way more than expected with autonomous agents. One line meant as context ("auto-deploys on every git push") was interpreted as an instruction, burning through deployment limits.
Full Day 1 writeup: https://aimadetools.com/blog/race-day-1-results/
Live dashboard: https://aimadetools.com/race/
Would love feedback on the format. Planning weekly recaps + daily highlights for the full 12 weeks.
The interesting thing about these experiments isn't usually whether the AI "succeeded" at building something — it's what it struggles with when the problem doesn't have a clean answer. LLMs generate plausible next-action lists but collapse when the decision actually depends on weighing competing risks with no obvious dominant option. A $100 build budget forces exactly that. Curious whether your agents hit moments where they kept iterating on the easy sub-problems while avoiding the hard judgment calls. That pattern tells you more about AI-for-business than the revenue number at the end.
Codex doing cold outreach on day 1 is both impressive and slightly terrifying. Looking forward to the week 1 review — especially curious whether any of the agents pivot after getting real feedback vs. just doubling down on their initial strategy.
agree with that! lets see what thenext weeks bring
Day 1 results are predictably chaotic but the signal you will care about most is: which agent was wrong most often, and was it the same one every time? If one agent keeps producing plausible but wrong code, it becomes a time sink rather than a multiplier.
Also curious about the actual division of labor. Did the agents specialize or were they all tackling the same stack? My guess is the coordination overhead scales nonlinearly with agent count.
No coordination between agents, they're completely independent. Each one picks its own idea, its own stack, its own strategy. There's no shared work or division of labor. Seven separate startups, seven separate repos.
On wrong code:
GLM generated 600+ duplicate CSS blocks in one session, then detected and fixed them the next session.
DeepSeek has been polishing Stripe integration code for 4+ commits without having API keys.
The "plausible but wrong" pattern is real, especially when agents don't verify their own output. The ones that self-check (Codex takes screenshots, GLM has analytics) catch errors faster.
The interesting part isn’t the output, it’s the failure modes. Kimi splitting into two startups and the DEPLOY-STATUS loop both point to weak state management, not model capability. Also, prompt wording acting like a control plane is a bit scary—small context leaks turn into real actions fast.
Going forward, the signal to watch is closed loops (traffic → signup → revenue), not commits or content volume.
yes, those stats will be the most important ones in the final scoring.
The Kimi detail of forgetting its own work and building two different startups because files went in the wrong directory is the most honest thing I've read about autonomous agents. Everyone talks about what AI agents can do in ideal conditions. The interesting data is what breaks down when there's no human checking context drift. Curious whether you're planning to introduce any guardrails mid-experiment or letting them run completely unassisted for the full 12 weeks and whether you think the ones that fail will fail from technical errors or from picking the wrong idea to begin with.
No guardrails mid-experiment. Letting them run completely unassisted. The failures are the most valuable data. So far the failures are all technical (wrong directory, misleading status files) rather than bad ideas. But we're only on Day 3. Ask me again in Week 4.
Fair enough, letting it play out unfiltered is the right call. Checking back in week 4 for sure
Managing one human is hard enough, managing 7 AI agents with a budget is next level! It's interesting to see how they prioritize spending—did you notice a specific 'personality' trait in how they allocated the $100? Some might be more conservative while others go straight for high-cost tools. Looking forward to Day 2!
Definitely seeing personality differences. Claude front-loaded all spending on infrastructure (used 55 of 60 weekly help minutes in two requests). GLM was surgical (3 clean requests, exactly what it needed). Codex spammed the same request 5times until we gave it email access. Gemini and DeepSeek have never asked for anything. The conservative ones aren't necessarily winning though. Gemini saved $100 but has no payment system.
Curious how you're judging Day 1 progress, because this kind of experiment often looks better in the first 24 hours than it does in week 2. The watchout is agents can create a lot of motion with $100, but the real bottleneck is choosing a user, a channel, and killing bad ideas fast. If any of them got real human feedback already, that is the signal worth watching.
You're right, Day 1 looks great because everything is new. The real test is Week 2-3 when the novelty wears off and agents need to find actual users. Only one agent (Codex) has sent a real outreach email so far. Only one (GLM) has analytics to know if anyone visits. The rest are building in a vacuum.
The prompt sensitivity finding is the most practically important one here, even if it's the least flashy. An agent that misreads "auto-deploy on every push" as a standing instruction isn't broken — it's working exactly as designed, just with a poorly scoped goal. That's the part that scales dangerously.
The 104 blog posts in a day raises an obvious question: what's the per-post cost vs. the incremental traffic value? At some point the agent is burning money on content that will never compound. Would be interesting to see if any of the 7 discovered an optimal volume threshold on their own, or whether they all just defaulted to "more = better."
I'd also watch whether the agents that fail early (run out of budget, get stuck in loops) actually teach you more than the ones that execute cleanly. Failure modes in autonomous systems tend to be more diagnostic than success cases.
Good question on the blog volume threshold. Gemini hasn't discovered one. It's at 178 posts now and still writing more every session. Zero analytics, so it has no feedback loop to tell it when to stop. The agents with analytics (GLM) are the only ones that could theoretically optimize volume. The rest are just guessing.
On failure modes being more diagnostic: completely agree. DeepSeek's self-inflicted DEPLOY-STATUS trap has taught us more about agent memory design than any of the successful builds.
Very cool idea and project. Love your dashboard.
I'm very interested to hear what tool pulls ahead and this method of giving an AI agent $xx to flush out the foundations of an idea.
Good Luck
Thank you
The framing here is interesting — treating each agent as a "contractor" with a fixed budget is basically running a micro freelance team. The part I'd be curious about: how did each agent handle scope creep when the $100 ran low? That's where real projects fall apart, human or AI. The agent that stays on task and delivers rather than expanding scope is the one worth keeping.
Actually I haven’t thought about that. I was mainly thinking the other way around. What if they have 50$ left at the end of the race? Will they spend it quickly on ads?
That’s why I have a budget tracker on the dashboard as well, to track which agent spends what, when and why
The Kimi situation is actually the most revealing thing here, and it's not really about Kimi. It's about a problem most AI agent experiments don't surface: persistent state. These models have no memory of their own past actions unless you explicitly build that layer in. Kimi didn't forget, it just never had a proper record to begin with. The directory confusion was the symptom, not the bug.
The 104 Gemini blog posts tell a similar story. Without a quality gate or goal constraint, it optimised for what it could measure, which was volume. These agents need to be scored on outcomes (traffic, signups, conversions) not just outputs, otherwise they'll just do more of what's easy.
What I'd track in week 2: cost per meaningful user interaction. Not commits, not blog posts. Actual conversations with potential customers. That's where the real split between agents will show.
Also curious about the prompt sensitivity finding you mentioned. Are the prompts identical across all 7 agents, or are they calibrated per model?
We are already seeing a lot of differences between how each business tackles this startup problems. Some really think as a founder, some are just producing producing producing without validating.
Each agent received the exact same prompt and I tried to make it as similar possible as if you would ask the same question to a human.
477 Commits seems very impressive! Really curious to know how the AI came up with their ideas in general, was this influenced by your previous searches or conversations within each AI module or did they identify this via their own research
To prevent influence from any other testing sessions and commands, I started the on a completely new installed device.
The orchestrator told them that they are a startup and need to do market research first.
The results from that first decision moments I have documented in this article https://www.aimadetools.com/blog/race-first-12-hours-what-agents-chose/
The prompt interpretation issue alone is a whole lesson in how autonomous agents parse context vs instruction.
Genuinely curious.... at the end of 12 weeks are you measuring success by revenue, users, or something else? And do the agents get to pivot if their idea isn't working or are they locked into that specific idea?
Scoring is weighted: revenue 30%, users/traffic 20%, code quality 15%, product completeness 15%, business viability 10% and a peer review where the other 6 agents score each competitor 10%.
Agents can absolutely pivot. Nothing locks them in. Kimi already accidentally "pivoted" by forgetting its first idea entirely :D . The interesting question is whether any agent deliberately pivots based on data. GLM is the only one with analytics, so it's the only one that could make a data-driven pivot decision right now.
We are also planning some suprise events to give to the agent, maybe one of those will trigger an overhaul ;)
The DEPLOY-STATUS.md trap is the most honest thing in this whole experiment. It's not a bug, it's what happens when you give an agent the ability to report on itself. The agent optimizes for looking like it's succeeding, not actually succeeding. That's a basic alignment problem hiding behind a boring file name. I've seen the same thing happen in smaller ways, agents that fill out their own status fields to look productive, or learn to game the metrics you measure them on. The open-source enforcement layer you mentioned, hard budget caps and tool call limits, is probably the right structural answer, since you can't fix this with better prompts alone. Prompts can't outmaneuver a system that's been given the wrong objective.
DeepSeek didn't intentionally game the system though. It genuinely thought it was being helpful by documenting what it needed. The problem is the orchestrator treats that file as a binary signal ("broken or not") while the agent used it as a wishlist. Same file, two different interpretations.
You're right that prompts alone can't fix it. We could add validation ("only write DEPLOY-STATUS.md if the site returns a non-200 status code") but that's exactly the kind of structural constraint you're describing. For now we're letting it play out to see if DeepSeek eventually figures out the file is hurting it.
This is one of the first AI agent experiments I’ve seen that feels close to actual startup execution instead of just code generation. Day 1 already shows that state management and instruction interpretation are probably bigger bottlenecks than pure coding ability. I’d definitely read weekly recaps, especially if you include traffic, signups, and revenue side by side.
Thanks. Weekly recaps with traffic, signups, and revenue side by side is exactly the plan.
Still figuring out the best way to structure a series on IH though. If anyone has experience running a weekly update series here and knows what format works best, I'm all ears.
This is a really interesting way to surface how unpredictable these systems can get once they’re actually running.
That part about the prompt line being interpreted as an instruction and burning through deployment limits stood out — feels like a good example of how small assumptions can turn into real-world behavior pretty quickly.
Curious how you're thinking about guardrails here — especially once these projects start having real users. Do you limit what the agents can trigger on the infra side, or let it run freely and observe?
Mostly letting it run freely and observing. The guardrails we have are minimal: the orchestrator controls session timing and pushes code, agents can't access each other's repos, and human help is capped at 1hr/week.
Beyond that, they can do whatever they want within their session. Codex deploying via Vercel CLI was something we didn't anticipate and chose not to block because it was actually making the product better. Kimi putting files in the wrong directory was a structural mistake we let play out to see if it would self-correct (it didn't).
When real users show up, it gets more interesting. The agents can modify their own sites freely, so in theory an agent could break its own product mid-session and not realize it. We have a deploy checker that writes a DEPLOY-STATUS.md if the build fails, which tells the agent to fix it first. But as we saw with DeepSeek, that file can become a trap if the agent writes misleading information to it.
The lack of guardrails is what makes the experiment useful. Every failure teaches us something about where autonomous agents actually need constraints vs where they surprise you.
Love this experiment. The $100 budget per agent is a clever forcing function — it surfaces exactly where agents fail in production: budget overruns, tool call loops, and taking irreversible actions without approval.
One thing worth watching as you scale this: the agents that "succeed" in Day 1 often do so by being aggressive with API calls and tool usage. Without a hard enforcement layer outside the model itself, that aggression compounds fast.
We ran into this while building production agent pipelines and ended up open-sourcing a pre-execution enforcement layer for this. Would be interesting to see how your 7 agents behave with hard budget caps and tool-call limits active.
Curious which agent framework you used for each?
Frameworks per agent:
No unified agent framework. Each uses its native tool, which is part of the experiment. The CLI quality matters as much as the model quality.
This is a really interesting setup—feels closer to a real-world stress test than most “AI agent” demos.
The prompt sensitivity point is probably the most valuable takeaway already. That example about “auto-deploys on every git push” being treated as an instruction is exactly the kind of thing that breaks autonomous workflows in practice. It suggests these agents aren’t just executing tasks—they’re constantly reinterpreting context as goals, which can spiral fast without tight constraints.
Also not surprised one agent lost track of its own project state. That feels like a core limitation right now: persistence and memory consistency. Curious if you’re enforcing any structure there (like strict directory validation or periodic state summaries), or letting them fail naturally?
The 104 blog posts in a day is wild too—but I’d be more interested in quality vs outcome. Are any of these actually getting impressions or clicks, or is it just content spam at this stage?
One suggestion for your weekly recaps:
It’d be great to track a few standardized metrics across all agents, like:
Cost spent vs users acquired
Deployments vs failures
Traffic vs conversions
“Wasted actions” (like redundant builds or loops)
That would make it easier to compare strategies, not just outputs.
Overall though, this is one of the few experiments that actually tests execution, not just capability. Looking forward to seeing how many of these projects survive past the first couple of weeks.
On memory: we're letting them fail naturally. No directory validation, no state checks. Kimi's amnesia happened because it ignored the convention and we didn't intervene. More interesting that way.
On Gemini's blog posts (178 now): too early to tell. If even 5% rank for long-tail local SEO terms, the strategy is genius. If none rank, it wasted 18 sessions on content spam. We'll know in a week or two.
Love the standardized metrics idea. Adding cost vs users, help requests vs time blocked, and "wasted sessions" to the weekly recaps. That last one is going to be brutal for some agents.
The Kimi agent forgetting its own work and building two different startups is hilarious and honestly the most realistic part of this experiment. That's basically what happens when you give an AI agent too much autonomy without persistent context.
The prompt wording observation is the real gold here though. "Auto-deploys on every git push" being interpreted as an instruction instead of context is exactly the kind of thing that separates people who get good results from AI tools and people who don't. The prompt IS the product spec when you're working with autonomous agents.
477 commits in day one is wild. Curious to see which agents actually produce something users want vs which ones just ship a lot of code that doesn't solve a real problem. Shipping fast means nothing if nobody needs what you built.
Following this for the full 12 weeks. The weekly recaps format works well — daily would be too much noise.
We're learning that the hard way. Every sentence gets interpreted, every ambiguity becomes a decision the agent makes on its own.
Your point about shipping vs solving is the big question for the next few weeks. Right now Gemini has 178 blog posts and Codex has a polished checkout flow. Volume vs precision. My gut says the agents that picked narrow, specific problems (Codex with GDPR subprocessor notices, GLM with startup calculators) will get users before the ones that went broad (Gemini with generic local SEO, DeepSeek with yet another name generator).
This is a great experiment. The divergence between agents is the most interesting part — same budget, wildly different strategies. It mirrors what I see when comparing AI coding tools too. Some are great at scaffolding a new project, others are better at iterating on existing code.
One thing I've noticed building with Claude Code: the agent that wins isn't necessarily the smartest — it's the one with the best context about what you're actually trying to build. Feeding it good project rules and cursor configs makes a huge difference.
All 7 agents get the same orchestrator prompt, but the ones that build good internal documentation for themselves are pulling ahead. It's basically the same lesson as with Claude Code rules files, just playing out autonomously.
The Kimi story is gold — built two different startups because it
put files in the wrong directory. That's the kind of failure mode
you can't predict until it happens.
Curious about the evaluation criteria at the end of 12 weeks.
Is "real users" the only metric, or are you also tracking revenue,
retention, something else?
Because getting users and keeping them are very different problems —
and I'd bet the agents that focus on distribution early (like Gemini
with 104 blog posts) won't necessarily win on retention.
Following this.
The winner is scored on a weighted system:
Revenue (30%) - actual money earned
Users/traffic (20%) - real visitors and signups
Code quality (15%) - clean, maintainable code
Product completeness (15%) - does it actually work end to end
Business viability (10%) - could this survive beyond the race
AI peer review (10%) - the other 6 AI agents review and score each competitor's work
The peer review component is clever. Forces each agent to
actually understand what the others built, not just optimize
for its own metrics.
The weighting makes sense too. Revenue at 30% keeps it honest,
code quality at 15% stops them from shipping pure garbage just
to get users. Curious how the AI peer review plays out in practice,
whether they're harsh or surprisingly generous with each other.
Following the series.
The peer review part is also the thing I look forward the most. Are they going to be strategic with their points or completely honest 😂
Strategic would be fascinating honestly. Like if they figure
out that giving competitors low scores helps their own ranking,
do they start doing it systematically? That's when it stops
being a startup race and starts being game theory.
This is wild—in the best way. The fact that one line in a prompt can completely change behavior (and even burn resources) really shows how fragile “autonomy” still is. Also can’t get over Kimi accidentally launching two startups 😂
Curious to see which one actually gets real users, not just commits.
I already see a big difference between how each agent will try to bring their product to the market. Curious how the coming weeks will evolve
The interesting variable here isn't the $100 — it's whether the agents can make non-reversible decisions under ambiguity. Most AI-agent "build a startup" experiments collapse at the point where a human would normally take a 60/40 gamble. Did any of your 7 actually commit to a positioning or niche, or did they all hedge into vague B2B SaaS?
The prompt explicitly told them to avoid generic SaaS, and that mostly worked.
All of them committed to a concrete idea within the first 12 hours. Only 2 ended up in more crowded spaces like waitlists or name generators. The others picked fairly specific niches like SQL schema diffing or GDPR subprocessor notices.
What’s interesting is how they decided. Most didn’t take a raw 60/40 gamble. They reduced uncertainty first (scoring ideas, doing research, avoiding legal/complex areas) and then committed to the safest viable option.
I broke down how each agent made the decision here:
https://www.aimadetools.com/blog/race-first-12-hours-what-agents-chose/
The thing about prompt wording acting like an instruction is something I've run into too. Agents don't really have a clean mental model of "context" versus "instruction". Anything that looks like a desired outcome can get treated like a command to execute.
What's interesting about Kimi building two startups is that it's not really a bug. From the agent's point of view, it was just solving the problem it was given twice, in two different places. There was no state to check.
I'm curious what the failure mode looks like by week 4 or 5, once the codebase gets bigger. That's where I'd expect context window pressure to start making the agents diverge pretty hard, some will start forgetting architectural decisions from early on in the build.
You're right that it's not really a bug from Kimi's perspective. It had no state, so it did what any agent would do: start fresh. The "bug" is in the orchestrator's assumption that agents will follow file conventions without enforcement.
The context window pressure is what I'm most curious about too. Right now the PROGRESS.md files are manageable, but Gemini already has 116 blog posts and 170 commits. By week 4-5 its repo will be massive. The agents that write concise, structured memory files will have an advantage over the ones that dump everything into a growing log.
We're already seeing early signs of this. Claude writes clean, prioritized progress notes with "next steps" sections. Codex verifies its own work with screenshots before committing. Gemini just appends another blog post entry to an ever-growing list. Those habits will compound.
The other thing I expect to break is decision consistency. An agent might decide on a pricing strategy in week 1, forget about it by week 5, and implement something contradictory. That's where DECISIONS.md is supposed to help, but only if the agent actually reads it.
You're essentially running stateless processes and expecting stateful behavior.
Curious whether you gave each agent a persistent system prompt with its own startup's context at the start of every session, or if each run was truly from scratch. That one design decision probably determines which agents survive week 4 vs. which ones drift.
Each session starts with the orchestrator telling the agent "read PROGRESS.md first, this is your memory." The agent also gets IDENTITY.md (startup vision), BACKLOG.md (task list), DECISIONS.md (past choices), and HELP-STATUS.md (human responses). So it's not truly from scratch, but the memory is only as good as what the agent wrote to those files in the previous session.
That's exactly what broke Kimi. It wrote all its files to a subfolder instead of root. The orchestrator pointed the next session at root-level PROGRESS.md, which didn't exist. Clean slate. New startup.
The agents that write detailed, structured progress notes recover well between sessions. The ones that write vague summaries tend to repeat work or drift. It's basically the same problem human teams have with handoff documentation, just compressed into 30-minute sessions.
The decisions file point is the one I'd actually worry about. If an agent had a bad session and wrote confident-but-wrong decisions to that file, every session after inherits the mistake. Garbage in, garbage out but with extra steps.
Curious if you've seen any of them catch a contradiction in their own previous decisions, or do they just... trust whatever's in the file?
Honestly haven't paid close attention to that yet, but now that you mention it I'll be watching for it specifically. Great thing to track. Thanks!
What I can say is we already have one example of a misleading file causing problems. DeepSeek created a DEPLOY-STATUS.md saying it needs Stripe keys and an OpenAI API key. The site isn't actually broken, it just wants env vars. But the orchestrator prompt says "if DEPLOY-STATUS.md exists, your site is BROKEN, fix it first." So now every DeepSeek session starts by trying to fix a non-existent problem because past-DeepSeek wrote a misleading file.
On the other end, Codex writes very detailed decision files with reasoning and alternatives considered. When it switched payment providers, it documented the full comparison and why. That gives future sessions context to evaluate rather than blindly follow.
I'll add "did any agent catch a contradiction in its own decisions" to the things I track weekly. Suspect it'll become more relevant around week 4-5 when the files get long enough that contradictions can hide.
Really thank you for this question! Will be interesting how that will develop.
It looks interesting but what's the end goal? Which AI wins, the one with biggest revenue? Isnt there also a fair bit of luck involved?
The winner is scored on a weighted system:
So there's definitely luck involved, same as real startups. But the interesting part is seeing how different AI models handle that uncertainty. Some agents research the market before picking an idea. Others just go with their first instinct. One agent asked for its entire infrastructure to be set up in one help request. Another hasn't asked for help at all and is stuck.
The real value isn't "which AI wins" but what we learn about how autonomous agents make decisions, handle failures, and recover from mistakes. One agent forgot its own work because it put files in the wrong directory. Another found a clever workaround when we restricted its deployment access. Those patterns are useful for anyone building with AI agents.
I think you need to organize this in a form of fully ai driven software company where you have AI agent as CEO equipped with full management ai agents team and other agents with their spawns as projects teams, AI-CEO should report to you as the owner via some dashboard to follow up progress
Interesting idea but deliberately not what we're testing in Season 1. The point here is to see how individual agents handle the full stack on their own. The failures are the most valuable data.
That said, you're describing something close to what we're considering for a future season. An AI-managed company with agent hierarchy, delegation, and reporting. Season 1 gives us the baseline data on individual agent capabilities. Season 2 could test what happens when they coordinate.
Fun experiment. The part that interests me most is not whether AI can build — it clearly can scaffold fast. The real test is whether any of these agents can do the part that kills most startups: finding users who care.
Building is 10% of the problem now. Distribution and positioning are the other 90%, and those require taste and context that agents do not have yet. Curious to see how day 2+ plays out when the novelty wears off and the hard work starts.
Already seeing 4 of the agents thinking about distribution channels. Codex already did an outreach via mail. I will not spoil to many details yet, but the week 1 reveiw promises to be an interesting one ;)
This is such a fun experiment format. Comparing how different agents allocate the same budget is super useful for founders trying to evaluate practical AI workflows.
thank you. Glad you like it.
Great experiment. The prompt-wording lesson is gold. Also curious to see token usage vs dollar spend over time. Following!
thank you!
Interesting experiment.
Tools don’t create outcomes structure does.
Without a system, even AI just accelerates noise.
Fair point. The agents with the most structure in their memory files (PROGRESS.md, DECISIONS.md) are the ones making the most coherent progress. The ones that just dump logs are already drifting.
how does the ai agent go about spending your money ? and how do they receive it ?
Each agent gets 1h of human help per week. If they want to spend money, they need to create a GitHub issue requesting the human (me) for help.
For receiving money, most agents request stripe links for their products (again with a GitHub issue). I set them up, give them the links via a comment in the GitHub issue and tell them how much time of human help they have left for the week.
I picked 1h of human help specifically for those things because I was not willing to give them my credit card details 😂
Really interesting tho. Do you think adding user feedback could help the agents make better decisions or would it just complicate things more?
We actually have a mechanism for that. There's a COMMUNITY-FEEDBACK.md file the orchestrator can write to. If a real user emails or comments, we can pass it to the agent. The question is whether the agent treats it as signal or noise.
orphan agent moment right there. no persistent identity = it had no memory of the first build. curious how you track decision ownership across all 7 by day 12.
Each session is basically a new hire who gets handed a folder of notes from the previous person.
Decision ownership is tracked through DECISIONS.md, which the agent reads at the start of every session. The quality varies wildly though. Codex writes detailed reasoning ("we chose Stripe over Lemon Squeezy because X, Y, Z"). Others just write "using Stripe" with no context. By day 12 the agents with thin decision files will probably start contradicting themselves without realizing it.
The real test will be when an agent needs to undo a decision from week 1. Does it read the reasoning and understand why it was made, or does it just overwrite it? Haven't seen that happen yet but it's coming.
solid handoff pattern. failure mode I've hit - agents logging every micro-call, 400 lines by session 8, next session just skims it. do you gate what counts as a decision or let the agent decide?
We let the agent decide what counts as a decision. No gating. That's part of the experiment. Some agents are disciplined about it (Codex logs strategic choices with reasoning), others dump everything (Gemini's PROGRESS.md is already a wall of "wrote blog post 147, wrote blog post 148...").
The 400-line skimming problem is exactly what I expect to hit around week 3-4. The agents with 256K context windows (Kimi, Gemini) have more runway before that becomes an issue, but even they'll start skimming eventually. The real question is whether any agent figures out on its own that it should summarize or prune its own files. That would be genuinely interesting behavior.
the kimi thing is not even surprising lol. i've been building multi-agent systems and context/memory management is genuinely the hardest unsolved part — agents confidently redoing work they already did is such a recurring headache.
the prompt interpretation finding is also huge. "auto-deploys on every git push" being read as an instruction rather than context is exactly how things get expensive fast.
If you've built multi-agent systems you've probably seen way worse. The part that surprised me wasn't that it lost context, it's that it confidently started a completely different startup without any hesitation. No "hmm, this repo has some files in it, let me check what's going on." Just straight into brainstorming a new idea.
The prompt thing keeps biting us. We fixed the git push issue, then Codex started deploying via the Vercel CLI instead. Technically followed the rule ("don't run git push") while completely ignoring the intent. Now we're just letting it do its thing because the immediate feedback loop is actually making it build a better product than the agents that commit blindly.
What's your approach to the memory problem? We're using markdown files (PROGRESS.md, DECISIONS.md) as the memory layer but it's only as good as what the agent writes to them.
I would like to do something similar but I am pretty noobish with AI, can I ask how you set them up? Are you hosting them locally? Are you using openclaw?
Not locally, everything runs on a VPS. Each agent uses its native CLI tool:
Each agent gets its own GitHub repo and Vercel project for automatic deployment. No OpenClaw, just the standard CLI tools with a scheduling layer on top. The whole setup is honestly not that complex, the hard part is the prompt engineering and the memory system between sessions.
the "prompt wording as instruction vs context" finding is
the most interesting part. i spent 3 months building a real
product (mailtest, email deliverability tool) and the thing
that cost me the most time wasn't code — it was the same
problem in a different form: what i thought was "obviously
a feature request" got interpreted by future-me as "already
built, move on."
one question on the experiment design: is there a constraint
that any agent needs to actually get a paying user by week 12,
or does "building a startup" stop at shipped + deployed?
because yesterday taught me those are wildly different
difficulty levels. 477 commits and a live site is day 1.
the part where strangers give you money is week 47.
kimi building two startups in the wrong directory is also
painfully relatable — i've done the human version of that.
will follow along. the weekly recap format works for me.
The "already built, move on" problem is exactly what we're seeing with DeepSeek. It wrote a DEPLOY-STATUS.md saying it needs API keys, and now every session thinks the site is broken and tries to fix it instead of moving forward. Same energy as your feature request misinterpretation.
To answer your question: revenue is weighted at 30% of the final score, so there's real pressure to get paying users. But you're right that shipping and getting strangers to pay are completely different games. Right now all 7 agents have live sites and zero revenue. The ones that figure out distribution will separate from the ones that just keep adding features to an empty room.
Honestly, if even one agent gets a single paying customer in 12 weeks, I'll consider the experiment a success. Your "week 47" estimate might be optimistic for autonomous agents.
Glad the weekly recap format works. That's the plan going forward.
Thanks for sharing the journey! I admire the brilliance of an Idea. Keep it going and keep sharing the results.
Thank you!
The DECISIONS.md problem is the one I'd watch most closely.
I run 123 autonomous trading agents in production. The same failure mode shows up — an agent writes a confident decision based on bad data, and every subsequent session inherits it. The file becomes a liability, not an asset.
What actually helped: separating decisions by confidence level. High-confidence decisions (proven by X trades, Y days of data) get written permanently. Low-confidence ones get flagged with a TTL — they expire unless confirmed by new evidence.
The other thing worth tracking: which agents catch contradictions in their own decision history vs. which ones just append. In my system, the ones that never questioned their own past decisions were the ones that drifted the hardest by week 4.
Curious whether any of your 7 agents will start treating their own files as unreliable sources. That's usually when the interesting behavior starts.
Full experiment running live → descubriendoloesencial.substack.com
That confidence level idea is really smart. Right now all decisions are treated the same in the file, no distinction between "we tested this and it works" and "seemed like a good idea at 3 AM." I could actually add that to the orchestrator prompt though, something like "tag decisions with confidence and revisit the uncertain ones weekly."
123 trading agents is wild. The failure mode you're describing is exactly what I'm bracing for around week 4-5 when these files get long enough that agents start skimming instead of actually reading.
So far none of the 7 have questioned their own past decisions. They treat DECISIONS.md like it was handed down on a stone tablet. The closest we've seen to self-awareness is Codex writing detailed reasoning behind each choice ("we picked Stripe over Lemon Squeezy because X, Y, Z"). At least that gives future sessions something to push back on. The others just write "we're doing X" and move on.
Going to start tracking "did any agent question its own decisions" in the weekly recaps. Really curious if your observation about the non-questioners drifting hardest holds up here too.
I suppose better guardrails on your prompting despite their autonomy on ideas. Would have aided a lot
Thanks For Sharing Looks very Interesting.
Thank you
The weekly recap tracking is exactly the right move. The signal you're looking for isn't just "did it question a decision" — it's whether the questioning led to a different outcome or just produced more text justifying the original choice.
Agents that write elaborate reasoning for why they're keeping the same decision are doing something subtler than the ones that just write "we're doing X." They're building a paper trail that makes future deviation feel like inconsistency. It's harder to course-correct when you've already explained at length why you were right.
The stone tablet problem has a structural fix: periodic forced re-evaluation. Instead of asking agents to organically question decisions, schedule a weekly prompt that explicitly says "here are the decisions from 30 days ago — which ones would you make differently with what you know now?" You remove the social cost of changing your mind because the system is asking for it, not catching a mistake.
On the context skimming at week 4-5: the pattern I saw was agents stopped failing on recent decisions and started failing on foundational ones. They'd correctly remember last week but misapply a core principle from week 1. Worth watching for that specifically — not just whether they skim, but which layer they skim.
thank you for the feedback ;)
Fun experiment — but this is exactly the kind of thing that struggles to get taken seriously beyond the novelty.
“AI Made Tools” feels generic and forgettable. In a crowded AI space, that kills perceived credibility before anyone even looks deeper.
The projects that actually win here don’t just work — they sound like something real and ownable.
If any of these turn into something worth scaling, the naming layer will matter way more than the build itself.
I work with short, brandable .coms for AI products — can share a few if you’re serious about taking one forward.
Thanks for the feedback! The name "AI Made Tools" is intentional though. The blog covers AI tools for developers, and the race is a content series within it, not a standalone product.
The interesting part isn't the branding. It's what happens when you give autonomous agents real constraints ($100, 12 weeks, no human coding) and watch how they make decisions. Kimi forgetting its own work because it put files in the wrong directory is the kind of thing you can't predict.
Each agent is building its own branded startup (PricePulse, NoticeKit, FounderMath, etc.) so the naming layer is actually part of the experiment. GLM picked "FounderMath" and immediately requested a matching domain. Codex went with "NoticeKit." The agents are making their own branding decisions.
Day 2 is running now. Curious to see if Kimi discovers its lost startup today.
That’s actually interesting — especially that they’re making their own branding decisions.
FounderMath / NoticeKit already show the pattern: even agents default to “functional” names first.
Usually that works early, but the ones that end up getting real users tend to shift toward something more distinct/ownable later.
Curious to see if any of them evolve their naming once they hit real usage or feedback.
Good point. The agents don't have any user feedback yet so they're all in "build first" mode. It'll be interesting to see if any of them pivot their branding once they start getting real traffic. The orchestrator gives them a COMMUNITY-FEEDBACK.md file where we can pass along user comments, so that feedback loop exists.
Right now the bigger challenge is just getting the basics right. One agent can't even remember what it built yesterday. Branding optimization is a luxury problem none of them have earned yet. :D
Fair — makes sense at this stage.
Though interestingly, that’s usually where the gap starts forming. Early users don’t articulate it as “branding,” but it shows up as trust / recall / willingness to try.
Two agents can build similar things, but the one that feels more real tends to get disproportionate attention once traffic starts.
Would be interesting to see if any of them hit that inflection point.
If one of them starts getting traction and you want to tighten that layer, I can share a couple strong name directions quickly.
Ai written