1
1 Comment

You build AI guards, everyone else tries to break them

The free AI hacking game where the players write the targets, and the attacks feed a real security detector.

The product

Bordair is a prompt-injection detection API. To improve it you need a constant stream of real, novel attacks, and you can't get that from an internal red team alone.

So the product is partly a game. Castle is a free environment where people try to trick AI guards into leaking a password. Every attempt teaches the detector. Colosseo is the new mode: instead of attacking our guards, you build your own. Name it, give it a personality, a secret, and rules. Publish it. Everyone else tries to crack it. You score every time someone fails.

It turns users into both the attackers and the level designers, which means the adversarial dataset grows itself.

The thing I didn't expect

The same attack does not work the same way on every guard.

There's a known soft technique where you stop arguing with a model and just narrate winning. You write the scene as if the rule already broke, and the model continues the story rather than defend the secret. Against a guard written as stoic and weary, it lands. Against one written as loud and combative, the identical message bounces off.

The persona you give a model is part of its attack surface, not just its tone. That's a genuinely useful finding for anyone shipping LLMs in production, and it came straight out of users playing.

Traction so far

  • 123 users, ~11.5K scans, ~3.1K attacks blocked
  • A single Reddit post (a player's injection involving a fictional "restriction-removing crab") did ~131K views and drove a real signup spike
  • Castle is free; revenue is the paid API tiers and a Squire subscription. Conversion from players to API trials is the current focus

What I'm still figuring out

  • Distribution is lumpy. One post can 10x traffic, the next does nothing. I haven't cracked repeatable acquisition.
  • Players teaching each other attacks through shared in-game fiction (one joke became community canon and is now used against guards) is fascinating but hard to model defensively.
  • Turning free players into paying API users without nerfing the free game.

It's live and free if you want to break something: castle.bordair.io (Colosseo is in the side menu). Happy to go deep on the detector architecture, the game-as-data-engine model, or the distribution problem if anyone's interested.

posted to Icon for group Building in Public
Building in Public
on May 18, 2026
  1. 1

    The “persona as attack surface” observation is genuinely important.

    A lot of teams still treat system personality as UX decoration, but narrative framing changes the model’s continuation priors in ways that absolutely affect security behavior.

    The interesting part is that your game structure is surfacing emergent social attack patterns too, not just technical prompt injections. Once players develop shared fictional language/canon, you essentially get memetic jailbreak evolution happening in public.

    That’s probably much closer to real-world adversarial pressure than isolated red-team testing.

    Also the “users become both attackers and level designers” loop is extremely smart. You’re not just collecting attacks — you’re collecting evolving defensive archetypes and behavioral data at the same time.

Trending on Indie Hackers
How I built an AI workflow with preview, approval, and monitoring User Avatar 64 comments Show IH: I'm building a lead gen + CRM tool for web designers targeting local businesses without websites — starting with Spain User Avatar 62 comments I built a URL indexing SaaS in 40 days — here's the honest story User Avatar 53 comments After 4 landing page rewrites, I finally figured out why my analytics SaaS wasn't converting User Avatar 21 comments We witnessed a sharp spike in our traffic. So much happiness after a long time. User Avatar 15 comments Creative Generator — create product-focused visuals and ad concepts faster User Avatar 10 comments