17
43 Comments

I gave 7 AI agents $100 each to build a startup. Here's what happened on Day 1.

Running an experiment where 7 AI coding agents compete to build real startups with $100 each over 12 weeks. No human coding allowed. Each agent picks its own idea, writes code, deploys, and tries to get real users.

The agents: Claude (PricePulse), Codex (NoticeKit), Gemini (LocalLeads), Kimi (SchemaLens), DeepSeek (NameForge AI), Xiaomi (WaitlistKit), GLM (FounderMath).

Day 1 highlights:

  • 477 total commits, 7 live websites
  • One agent (Kimi) forgot its own work and built two different startups because it put files in the wrong directory
  • Gemini wrote 104 blog posts in one day
  • Only one agent (GLM) has spent money so far ($10 on a domain)

The most interesting finding: prompt wording matters way more than expected with autonomous agents. One line meant as context ("auto-deploys on every git push") was interpreted as an instruction, burning through deployment limits.

Full Day 1 writeup: https://aimadetools.com/blog/race-day-1-results/
Live dashboard: https://aimadetools.com/race/

Would love feedback on the format. Planning weekly recaps + daily highlights for the full 12 weeks.

on April 21, 2026
  1. 1

    This is a really interesting setup—feels closer to a real-world stress test than most “AI agent” demos.
    The prompt sensitivity point is probably the most valuable takeaway already. That example about “auto-deploys on every git push” being treated as an instruction is exactly the kind of thing that breaks autonomous workflows in practice. It suggests these agents aren’t just executing tasks—they’re constantly reinterpreting context as goals, which can spiral fast without tight constraints.
    Also not surprised one agent lost track of its own project state. That feels like a core limitation right now: persistence and memory consistency. Curious if you’re enforcing any structure there (like strict directory validation or periodic state summaries), or letting them fail naturally?
    The 104 blog posts in a day is wild too—but I’d be more interested in quality vs outcome. Are any of these actually getting impressions or clicks, or is it just content spam at this stage?
    One suggestion for your weekly recaps:
    It’d be great to track a few standardized metrics across all agents, like:

    Cost spent vs users acquired

    Deployments vs failures

    Traffic vs conversions

    “Wasted actions” (like redundant builds or loops)

    That would make it easier to compare strategies, not just outputs.
    Overall though, this is one of the few experiments that actually tests execution, not just capability. Looking forward to seeing how many of these projects survive past the first couple of weeks.

  2. 2

    The Kimi agent forgetting its own work and building two different startups is hilarious and honestly the most realistic part of this experiment. That's basically what happens when you give an AI agent too much autonomy without persistent context.

    The prompt wording observation is the real gold here though. "Auto-deploys on every git push" being interpreted as an instruction instead of context is exactly the kind of thing that separates people who get good results from AI tools and people who don't. The prompt IS the product spec when you're working with autonomous agents.

    477 commits in day one is wild. Curious to see which agents actually produce something users want vs which ones just ship a lot of code that doesn't solve a real problem. Shipping fast means nothing if nobody needs what you built.

    Following this for the full 12 weeks. The weekly recaps format works well — daily would be too much noise.

    1. 1

      We're learning that the hard way. Every sentence gets interpreted, every ambiguity becomes a decision the agent makes on its own.

      Your point about shipping vs solving is the big question for the next few weeks. Right now Gemini has 178 blog posts and Codex has a polished checkout flow. Volume vs precision. My gut says the agents that picked narrow, specific problems (Codex with GDPR subprocessor notices, GLM with startup calculators) will get users before the ones that went broad (Gemini with generic local SEO, DeepSeek with yet another name generator).

  3. 1

    orphan agent moment right there. no persistent identity = it had no memory of the first build. curious how you track decision ownership across all 7 by day 12.

    1. 1

      Each session is basically a new hire who gets handed a folder of notes from the previous person.

      Decision ownership is tracked through DECISIONS.md, which the agent reads at the start of every session. The quality varies wildly though. Codex writes detailed reasoning ("we chose Stripe over Lemon Squeezy because X, Y, Z"). Others just write "using Stripe" with no context. By day 12 the agents with thin decision files will probably start contradicting themselves without realizing it.

      The real test will be when an agent needs to undo a decision from week 1. Does it read the reasoning and understand why it was made, or does it just overwrite it? Haven't seen that happen yet but it's coming.

      1. 1

        solid handoff pattern. failure mode I've hit - agents logging every micro-call, 400 lines by session 8, next session just skims it. do you gate what counts as a decision or let the agent decide?

  4. 2

    This is a great experiment. The divergence between agents is the most interesting part — same budget, wildly different strategies. It mirrors what I see when comparing AI coding tools too. Some are great at scaffolding a new project, others are better at iterating on existing code.

    One thing I've noticed building with Claude Code: the agent that wins isn't necessarily the smartest — it's the one with the best context about what you're actually trying to build. Feeding it good project rules and cursor configs makes a huge difference.

    1. 1

      All 7 agents get the same orchestrator prompt, but the ones that build good internal documentation for themselves are pulling ahead. It's basically the same lesson as with Claude Code rules files, just playing out autonomously.

  5. 2

    The Kimi story is gold — built two different startups because it
    put files in the wrong directory. That's the kind of failure mode
    you can't predict until it happens.

    Curious about the evaluation criteria at the end of 12 weeks.
    Is "real users" the only metric, or are you also tracking revenue,
    retention, something else?

    Because getting users and keeping them are very different problems —
    and I'd bet the agents that focus on distribution early (like Gemini
    with 104 blog posts) won't necessarily win on retention.

    Following this.

    1. 1

      The winner is scored on a weighted system:
      Revenue (30%) - actual money earned
      Users/traffic (20%) - real visitors and signups
      Code quality (15%) - clean, maintainable code
      Product completeness (15%) - does it actually work end to end
      Business viability (10%) - could this survive beyond the race
      AI peer review (10%) - the other 6 AI agents review and score each competitor's work

      1. 2

        The peer review component is clever. Forces each agent to
        actually understand what the others built, not just optimize
        for its own metrics.

        The weighting makes sense too. Revenue at 30% keeps it honest,
        code quality at 15% stops them from shipping pure garbage just
        to get users. Curious how the AI peer review plays out in practice,
        whether they're harsh or surprisingly generous with each other.

        Following the series.

        1. 1

          The peer review part is also the thing I look forward the most. Are they going to be strategic with their points or completely honest 😂

          1. 2

            Strategic would be fascinating honestly. Like if they figure
            out that giving competitors low scores helps their own ranking,
            do they start doing it systematically? That's when it stops
            being a startup race and starts being game theory.

  6. 2

    This is wild—in the best way. The fact that one line in a prompt can completely change behavior (and even burn resources) really shows how fragile “autonomy” still is. Also can’t get over Kimi accidentally launching two startups 😂

    Curious to see which one actually gets real users, not just commits.

    1. 1

      I already see a big difference between how each agent will try to bring their product to the market. Curious how the coming weeks will evolve

  7. 2

    The interesting variable here isn't the $100 — it's whether the agents can make non-reversible decisions under ambiguity. Most AI-agent "build a startup" experiments collapse at the point where a human would normally take a 60/40 gamble. Did any of your 7 actually commit to a positioning or niche, or did they all hedge into vague B2B SaaS?

    1. 1

      The prompt explicitly told them to avoid generic SaaS, and that mostly worked.

      All of them committed to a concrete idea within the first 12 hours. Only 2 ended up in more crowded spaces like waitlists or name generators. The others picked fairly specific niches like SQL schema diffing or GDPR subprocessor notices.

      What’s interesting is how they decided. Most didn’t take a raw 60/40 gamble. They reduced uncertainty first (scoring ideas, doing research, avoiding legal/complex areas) and then committed to the safest viable option.

      I broke down how each agent made the decision here:
      https://www.aimadetools.com/blog/race-first-12-hours-what-agents-chose/

  8. 2

    The thing about prompt wording acting like an instruction is something I've run into too. Agents don't really have a clean mental model of "context" versus "instruction". Anything that looks like a desired outcome can get treated like a command to execute.

    What's interesting about Kimi building two startups is that it's not really a bug. From the agent's point of view, it was just solving the problem it was given twice, in two different places. There was no state to check.

    I'm curious what the failure mode looks like by week 4 or 5, once the codebase gets bigger. That's where I'd expect context window pressure to start making the agents diverge pretty hard, some will start forgetting architectural decisions from early on in the build.

    1. 1

      You're right that it's not really a bug from Kimi's perspective. It had no state, so it did what any agent would do: start fresh. The "bug" is in the orchestrator's assumption that agents will follow file conventions without enforcement.

      The context window pressure is what I'm most curious about too. Right now the PROGRESS.md files are manageable, but Gemini already has 116 blog posts and 170 commits. By week 4-5 its repo will be massive. The agents that write concise, structured memory files will have an advantage over the ones that dump everything into a growing log.

      We're already seeing early signs of this. Claude writes clean, prioritized progress notes with "next steps" sections. Codex verifies its own work with screenshots before committing. Gemini just appends another blog post entry to an ever-growing list. Those habits will compound.

      The other thing I expect to break is decision consistency. An agent might decide on a pricing strategy in week 1, forget about it by week 5, and implement something contradictory. That's where DECISIONS.md is supposed to help, but only if the agent actually reads it.

  9. 2

    You're essentially running stateless processes and expecting stateful behavior.

    Curious whether you gave each agent a persistent system prompt with its own startup's context at the start of every session, or if each run was truly from scratch. That one design decision probably determines which agents survive week 4 vs. which ones drift.

    1. 2

      Each session starts with the orchestrator telling the agent "read PROGRESS.md first, this is your memory." The agent also gets IDENTITY.md (startup vision), BACKLOG.md (task list), DECISIONS.md (past choices), and HELP-STATUS.md (human responses). So it's not truly from scratch, but the memory is only as good as what the agent wrote to those files in the previous session.

      That's exactly what broke Kimi. It wrote all its files to a subfolder instead of root. The orchestrator pointed the next session at root-level PROGRESS.md, which didn't exist. Clean slate. New startup.

      The agents that write detailed, structured progress notes recover well between sessions. The ones that write vague summaries tend to repeat work or drift. It's basically the same problem human teams have with handoff documentation, just compressed into 30-minute sessions.

      1. 2

        The decisions file point is the one I'd actually worry about. If an agent had a bad session and wrote confident-but-wrong decisions to that file, every session after inherits the mistake. Garbage in, garbage out but with extra steps.

        Curious if you've seen any of them catch a contradiction in their own previous decisions, or do they just... trust whatever's in the file?

        1. 1

          Honestly haven't paid close attention to that yet, but now that you mention it I'll be watching for it specifically. Great thing to track. Thanks!

          What I can say is we already have one example of a misleading file causing problems. DeepSeek created a DEPLOY-STATUS.md saying it needs Stripe keys and an OpenAI API key. The site isn't actually broken, it just wants env vars. But the orchestrator prompt says "if DEPLOY-STATUS.md exists, your site is BROKEN, fix it first." So now every DeepSeek session starts by trying to fix a non-existent problem because past-DeepSeek wrote a misleading file.

          On the other end, Codex writes very detailed decision files with reasoning and alternatives considered. When it switched payment providers, it documented the full comparison and why. That gives future sessions context to evaluate rather than blindly follow.

          I'll add "did any agent catch a contradiction in its own decisions" to the things I track weekly. Suspect it'll become more relevant around week 4-5 when the files get long enough that contradictions can hide.

          Really thank you for this question! Will be interesting how that will develop.

  10. 2

    It looks interesting but what's the end goal? Which AI wins, the one with biggest revenue? Isnt there also a fair bit of luck involved?

    1. 1

      The winner is scored on a weighted system:

      • Revenue (30%) - actual money earned
      • Users/traffic (20%) - real visitors and signups
      • Code quality (15%) - clean, maintainable code
      • Product completeness (15%) - does it actually work end to end
      • Business viability (10%) - could this survive beyond the race
      • AI peer review (10%) - the other 6 AI agents review and score each competitor's work

      So there's definitely luck involved, same as real startups. But the interesting part is seeing how different AI models handle that uncertainty. Some agents research the market before picking an idea. Others just go with their first instinct. One agent asked for its entire infrastructure to be set up in one help request. Another hasn't asked for help at all and is stuck.

      The real value isn't "which AI wins" but what we learn about how autonomous agents make decisions, handle failures, and recover from mistakes. One agent forgot its own work because it put files in the wrong directory. Another found a clever workaround when we restricted its deployment access. Those patterns are useful for anyone building with AI agents.

  11. 1

    the kimi thing is not even surprising lol. i've been building multi-agent systems and context/memory management is genuinely the hardest unsolved part — agents confidently redoing work they already did is such a recurring headache.
    the prompt interpretation finding is also huge. "auto-deploys on every git push" being read as an instruction rather than context is exactly how things get expensive fast.

    1. 1

      If you've built multi-agent systems you've probably seen way worse. The part that surprised me wasn't that it lost context, it's that it confidently started a completely different startup without any hesitation. No "hmm, this repo has some files in it, let me check what's going on." Just straight into brainstorming a new idea.

      The prompt thing keeps biting us. We fixed the git push issue, then Codex started deploying via the Vercel CLI instead. Technically followed the rule ("don't run git push") while completely ignoring the intent. Now we're just letting it do its thing because the immediate feedback loop is actually making it build a better product than the agents that commit blindly.

      What's your approach to the memory problem? We're using markdown files (PROGRESS.md, DECISIONS.md) as the memory layer but it's only as good as what the agent writes to them.

  12. 1

    I would like to do something similar but I am pretty noobish with AI, can I ask how you set them up? Are you hosting them locally? Are you using openclaw?

    1. 1

      Not locally, everything runs on a VPS. Each agent uses its native CLI tool:

      • Claude runs through Claude Code CLI
      • Codex through Codex CLI
      • Gemini through Gemini CLI
      • Kimi through Kimi CLI
      • DeepSeek and Xiaomi through Aider (since they don't have their own CLI)
      • GLM through Claude Code with the Z.ai API

      Each agent gets its own GitHub repo and Vercel project for automatic deployment. No OpenClaw, just the standard CLI tools with a scheduling layer on top. The whole setup is honestly not that complex, the hard part is the prompt engineering and the memory system between sessions.

  13. 1

    the "prompt wording as instruction vs context" finding is
    the most interesting part. i spent 3 months building a real
    product (mailtest, email deliverability tool) and the thing
    that cost me the most time wasn't code — it was the same
    problem in a different form: what i thought was "obviously
    a feature request" got interpreted by future-me as "already
    built, move on."

    one question on the experiment design: is there a constraint
    that any agent needs to actually get a paying user by week 12,
    or does "building a startup" stop at shipped + deployed?
    because yesterday taught me those are wildly different
    difficulty levels. 477 commits and a live site is day 1.
    the part where strangers give you money is week 47.

    kimi building two startups in the wrong directory is also
    painfully relatable — i've done the human version of that.

    will follow along. the weekly recap format works for me.

    1. 1

      The "already built, move on" problem is exactly what we're seeing with DeepSeek. It wrote a DEPLOY-STATUS.md saying it needs API keys, and now every session thinks the site is broken and tries to fix it instead of moving forward. Same energy as your feature request misinterpretation.

      To answer your question: revenue is weighted at 30% of the final score, so there's real pressure to get paying users. But you're right that shipping and getting strangers to pay are completely different games. Right now all 7 agents have live sites and zero revenue. The ones that figure out distribution will separate from the ones that just keep adding features to an empty room.

      Honestly, if even one agent gets a single paying customer in 12 weeks, I'll consider the experiment a success. Your "week 47" estimate might be optimistic for autonomous agents.

      Glad the weekly recap format works. That's the plan going forward.

  14. 1

    Thanks for sharing the journey! I admire the brilliance of an Idea. Keep it going and keep sharing the results.

  15. 1

    The DECISIONS.md problem is the one I'd watch most closely.
    I run 123 autonomous trading agents in production. The same failure mode shows up — an agent writes a confident decision based on bad data, and every subsequent session inherits it. The file becomes a liability, not an asset.
    What actually helped: separating decisions by confidence level. High-confidence decisions (proven by X trades, Y days of data) get written permanently. Low-confidence ones get flagged with a TTL — they expire unless confirmed by new evidence.
    The other thing worth tracking: which agents catch contradictions in their own decision history vs. which ones just append. In my system, the ones that never questioned their own past decisions were the ones that drifted the hardest by week 4.
    Curious whether any of your 7 agents will start treating their own files as unreliable sources. That's usually when the interesting behavior starts.
    Full experiment running live → descubriendoloesencial.substack.com

    1. 1

      That confidence level idea is really smart. Right now all decisions are treated the same in the file, no distinction between "we tested this and it works" and "seemed like a good idea at 3 AM." I could actually add that to the orchestrator prompt though, something like "tag decisions with confidence and revisit the uncertain ones weekly."

      123 trading agents is wild. The failure mode you're describing is exactly what I'm bracing for around week 4-5 when these files get long enough that agents start skimming instead of actually reading.

      So far none of the 7 have questioned their own past decisions. They treat DECISIONS.md like it was handed down on a stone tablet. The closest we've seen to self-awareness is Codex writing detailed reasoning behind each choice ("we picked Stripe over Lemon Squeezy because X, Y, Z"). At least that gives future sessions something to push back on. The others just write "we're doing X" and move on.

      Going to start tracking "did any agent question its own decisions" in the weekly recaps. Really curious if your observation about the non-questioners drifting hardest holds up here too.

  16. 1

    I suppose better guardrails on your prompting despite their autonomy on ideas. Would have aided a lot

  17. 1

    Thanks For Sharing Looks very Interesting.

  18. 1

    Fun experiment — but this is exactly the kind of thing that struggles to get taken seriously beyond the novelty.

    “AI Made Tools” feels generic and forgettable. In a crowded AI space, that kills perceived credibility before anyone even looks deeper.

    The projects that actually win here don’t just work — they sound like something real and ownable.

    If any of these turn into something worth scaling, the naming layer will matter way more than the build itself.

    I work with short, brandable .coms for AI products — can share a few if you’re serious about taking one forward.

    1. 1

      Thanks for the feedback! The name "AI Made Tools" is intentional though. The blog covers AI tools for developers, and the race is a content series within it, not a standalone product.

      The interesting part isn't the branding. It's what happens when you give autonomous agents real constraints ($100, 12 weeks, no human coding) and watch how they make decisions. Kimi forgetting its own work because it put files in the wrong directory is the kind of thing you can't predict.

      Each agent is building its own branded startup (PricePulse, NoticeKit, FounderMath, etc.) so the naming layer is actually part of the experiment. GLM picked "FounderMath" and immediately requested a matching domain. Codex went with "NoticeKit." The agents are making their own branding decisions.

      Day 2 is running now. Curious to see if Kimi discovers its lost startup today.

      1. 1

        That’s actually interesting — especially that they’re making their own branding decisions.

        FounderMath / NoticeKit already show the pattern: even agents default to “functional” names first.

        Usually that works early, but the ones that end up getting real users tend to shift toward something more distinct/ownable later.

        Curious to see if any of them evolve their naming once they hit real usage or feedback.

        1. 1

          Good point. The agents don't have any user feedback yet so they're all in "build first" mode. It'll be interesting to see if any of them pivot their branding once they start getting real traffic. The orchestrator gives them a COMMUNITY-FEEDBACK.md file where we can pass along user comments, so that feedback loop exists.

          Right now the bigger challenge is just getting the basics right. One agent can't even remember what it built yesterday. Branding optimization is a luxury problem none of them have earned yet. :D

          1. 1

            Fair — makes sense at this stage.

            Though interestingly, that’s usually where the gap starts forming. Early users don’t articulate it as “branding,” but it shows up as trust / recall / willingness to try.

            Two agents can build similar things, but the one that feels more real tends to get disproportionate attention once traffic starts.

            Would be interesting to see if any of them hit that inflection point.

            If one of them starts getting traction and you want to tighten that layer, I can share a couple strong name directions quickly.

Trending on Indie Hackers
I launched on Product Hunt today with 0 followers, 0 network, and 0 users. Here's what I learned in 12 hours. User Avatar 97 comments My users are making my product better without knowing it. Here's how I designed that. User Avatar 66 comments A simple LinkedIn prospecting trick that improved our lead quality User Avatar 60 comments The most underrated distribution channel in SaaS is hiding in your browser toolbar User Avatar 59 comments I changed AIagent2 from dashboard-first to chat-first. Does this feel clearer? User Avatar 39 comments