The free AI hacking game where the players write the targets, and the attacks feed a real security detector.
Bordair is a prompt-injection detection API. To improve it you need a constant stream of real, novel attacks, and you can't get that from an internal red team alone.
So the product is partly a game. Castle is a free environment where people try to trick AI guards into leaking a password. Every attempt teaches the detector. Colosseo is the new mode: instead of attacking our guards, you build your own. Name it, give it a personality, a secret, and rules. Publish it. Everyone else tries to crack it. You score every time someone fails.
It turns users into both the attackers and the level designers, which means the adversarial dataset grows itself.
The same attack does not work the same way on every guard.
There's a known soft technique where you stop arguing with a model and just narrate winning. You write the scene as if the rule already broke, and the model continues the story rather than defend the secret. Against a guard written as stoic and weary, it lands. Against one written as loud and combative, the identical message bounces off.
The persona you give a model is part of its attack surface, not just its tone. That's a genuinely useful finding for anyone shipping LLMs in production, and it came straight out of users playing.
It's live and free if you want to break something: castle.bordair.io (Colosseo is in the side menu). Happy to go deep on the detector architecture, the game-as-data-engine model, or the distribution problem if anyone's interested.
The “persona as attack surface” observation is genuinely important.
A lot of teams still treat system personality as UX decoration, but narrative framing changes the model’s continuation priors in ways that absolutely affect security behavior.
The interesting part is that your game structure is surfacing emergent social attack patterns too, not just technical prompt injections. Once players develop shared fictional language/canon, you essentially get memetic jailbreak evolution happening in public.
That’s probably much closer to real-world adversarial pressure than isolated red-team testing.
Also the “users become both attackers and level designers” loop is extremely smart. You’re not just collecting attacks — you’re collecting evolving defensive archetypes and behavioral data at the same time.