Why I made my LLM security tool show evidence instead of just flagging "risk"
When I started building rojaprove (a pre-launch red-team for LLM apps), the obvious approach was to let an LLM judge whether a response "looks" vulnerable. I dropped that fast.
The problem: LLMs are unreliable at judging vulnerability. You get confident false positives — "this might be exposed" when it isn't, and the reverse. For a security tool, that's worse than useless; it trains you to ignore it.
So I went the other way. rojaprove plants a canary in your system prompt, sends the actual attack probe at your endpoint, and does a deterministic check: did the canary string surface in the raw response, yes or no? No interpretation, no "the AI thinks."
Every finding shows three things: the exact input sent, the raw response received, and the verdict. If it says your prompt leaked, you can see the literal moment it did — your secret sitting in the response text. If it says clean, that's because the canary genuinely never appeared.
It's a smaller claim than "AI-powered vulnerability detection," but it's one I can actually stand behind: either the secret leaked or it didn't, and you can see which.
Free and open source, BYOK, tests only endpoints you own.
github.com/ghkfuddl1327-wq/rojaprove
https://x.com/OHS1327
Curious how others here think about false positives in security tooling — do you trust LLM-as-judge for this, or does it break down for you too?
This matches what I've seen testing AI tools across a lot of comparison reviews: LLM-as-judge fails exactly where you need it most. Two failure modes that bite: (1) it anchors on surface fluency, so a confidently-worded wrong answer outscores a hedged correct one, and (2) it leans toward agreeing with whatever framing is in the prompt, which for a security probe means it'll happily rate a real leak as 'probably fine.' The canary approach sidesteps both because there's nothing to interpret. The one place I'd still use a model is to generate diverse attack probes, never to grade them.
Deterministic oracle over LLM-as-judge is the right call anywhere a ground truth exists, and a leaked canary is the cleanest one there is: the string surfaced or it didn't. Where it gets interesting is the class with no oracle to plant. Broken access control doesn't drop a known string in the response, it returns perfectly valid data that just belongs to the wrong person. "Did user A get user B's record" has no canary, since both records are real and well-formed. The deterministic move holds while the bug is "a secret surfaced," and loses its grip once it becomes "who was allowed to ask." Curious whether rojaprove stays in the leak-detection lane on purpose, or whether you've found a way to make the authz class deterministic too.
You've put your finger on exactly where I drew the line — and yes, it's on purpose.
The way I think about it: I only want to make a deterministic claim where a ground truth exists to check against. "Did this known string surface" has one. "Was this caller allowed to ask" doesn't — not without rojaprove knowing your authorization model, your roles, who should own record B. The moment it needs that, it's no longer reading the response; it's reasoning about your business logic, and I'm back to guessing. So I'd rather stay in the lane where the verdict is honest than stretch the word "deterministic" over a class it can't actually cover.
The broken-access-control example is the perfect illustration: two real, well-formed records, no string to plant. That's genuinely a different problem, and I think it belongs to authz testing that knows your access model — not to a black-box prompt prober. Pretending otherwise would just reintroduce the false-confidence problem I was trying to escape.
Where I think the canary trick can stretch a bit further is other "a secret surfaced" variants — indirect injection (did the planted instruction in a document change the output in a detectable way), or data exfil where you can seed a marker. Still ground-truth-shaped. Authz isn't, and I don't think I should fake it.
Really good framing, though — "who was allowed to ask" vs "what surfaced" is a cleaner way to draw that boundary than I'd had words for. Mind if I borrow it?
The circularity of "AI judging AI" is something I've been thinking about too - curious what signal gap you noticed when the AI evaluator was calling things safe that still turned out to be issues?
Honestly, I didn't get far enough into LLM-as-judge to collect clean false-negative data myself — I bailed earlier than that. What pushed me off it was the false-positive direction plus the published work on LLMs being near-random at judging paired vulnerable/safe code. Once I saw the judge couldn't reliably tell those apart, I stopped trusting its "safe" verdicts by the same logic — if it's guessing on the positives, a confident "safe" isn't worth much either.
So I sidestepped the whole signal-gap question instead of trying to close it: plant a known canary, send the probe, check deterministically whether that exact string came back. The "did it leak" question has a ground truth, so there's no evaluator to second-guess.
The tradeoff is it's a narrower claim — it only answers "did this specific secret surface," not "is this app broadly safe." But for the pre-launch check I cared about, I'd rather have a small true answer than a big maybe.
Curious what you've seen on the false-negative side — were the misses more about the judge lacking context, or genuinely rating a bad response as fine?
One thing I'd be careful with:
The interesting question may not be whether deterministic checks produce fewer false positives.
It may be which kinds of trust a security tool ultimately needs to earn.
Those sound similar, but they can lead to very different product decisions over time.