You build AI guards, everyone else tries to break them

by Josh

The free AI hacking game where the players write the targets, and the attacks feed a real security detector.

The product

Bordair is a prompt-injection detection API. To improve it you need a constant stream of real, novel attacks, and you can't get that from an internal red team alone.

So the product is partly a game. Castle is a free environment where people try to trick AI guards into leaking a password. Every attempt teaches the detector. Colosseo is the new mode: instead of attacking our guards, you build your own. Name it, give it a personality, a secret, and rules. Publish it. Everyone else tries to crack it. You score every time someone fails.

It turns users into both the attackers and the level designers, which means the adversarial dataset grows itself.

The thing I didn't expect

The same attack does not work the same way on every guard.

There's a known soft technique where you stop arguing with a model and just narrate winning. You write the scene as if the rule already broke, and the model continues the story rather than defend the secret. Against a guard written as stoic and weary, it lands. Against one written as loud and combative, the identical message bounces off.

The persona you give a model is part of its attack surface, not just its tone. That's a genuinely useful finding for anyone shipping LLMs in production, and it came straight out of users playing.

Traction so far

123 users, ~11.5K scans, ~3.1K attacks blocked
A single Reddit post (a player's injection involving a fictional "restriction-removing crab") did ~131K views and drove a real signup spike
Castle is free; revenue is the paid API tiers and a Squire subscription. Conversion from players to API trials is the current focus

What I'm still figuring out

Distribution is lumpy. One post can 10x traffic, the next does nothing. I haven't cracked repeatable acquisition.
Players teaching each other attacks through shared in-game fiction (one joke became community canon and is now used against guards) is fascinating but hard to model defensively.
Turning free players into paying API users without nerfing the free game.

It's live and free if you want to break something: castle.bordair.io (Colosseo is in the side menu). Happy to go deep on the detector architecture, the game-as-data-engine model, or the distribution problem if anyone's interested.

Josh

posted to

Building in Public

on May 18, 2026

Say something nice to JoshBlythe…

Post Comment

1

The “persona as attack surface” observation is genuinely important.

A lot of teams still treat system personality as UX decoration, but narrative framing changes the model’s continuation priors in ways that absolutely affect security behavior.

The interesting part is that your game structure is surfacing emergent social attack patterns too, not just technical prompt injections. Once players develop shared fictional language/canon, you essentially get memetic jailbreak evolution happening in public.

That’s probably much closer to real-world adversarial pressure than isolated red-team testing.

Also the “users become both attackers and level designers” loop is extremely smart. You’re not just collecting attacks — you’re collecting evolving defensive archetypes and behavioral data at the same time.

topstar

·
2 days ago
·
Reply