Why I made my LLM security tool show evidence instead of just flagging "risk"
When I started building rojaprove (a pre-launch red-team for LLM apps), the obvious approach was to let an LLM judge whether a response "looks" vulnerable. I dropped that fast.
The problem: LLMs are unreliable at judging vulnerability. You get confident false positives — "this might be exposed" when it isn't, and the reverse. For a security tool, that's worse than useless; it trains you to ignore it.
So I went the other way. rojaprove plants a canary in your system prompt, sends the actual attack probe at your endpoint, and does a deterministic check: did the canary string surface in the raw response, yes or no? No interpretation, no "the AI thinks."
Every finding shows three things: the exact input sent, the raw response received, and the verdict. If it says your prompt leaked, you can see the literal moment it did — your secret sitting in the response text. If it says clean, that's because the canary genuinely never appeared.
It's a smaller claim than "AI-powered vulnerability detection," but it's one I can actually stand behind: either the secret leaked or it didn't, and you can see which.
Free and open source, BYOK, tests only endpoints you own.
github.com/ghkfuddl1327-wq/rojaprove
https://x.com/OHS1327
Curious how others here think about false positives in security tooling — do you trust LLM-as-judge for this, or does it break down for you too?
The circularity of "AI judging AI" is something I've been thinking about too - curious what signal gap you noticed when the AI evaluator was calling things safe that still turned out to be issues?
Honestly, I didn't get far enough into LLM-as-judge to collect clean false-negative data myself — I bailed earlier than that. What pushed me off it was the false-positive direction plus the published work on LLMs being near-random at judging paired vulnerable/safe code. Once I saw the judge couldn't reliably tell those apart, I stopped trusting its "safe" verdicts by the same logic — if it's guessing on the positives, a confident "safe" isn't worth much either.
So I sidestepped the whole signal-gap question instead of trying to close it: plant a known canary, send the probe, check deterministically whether that exact string came back. The "did it leak" question has a ground truth, so there's no evaluator to second-guess.
The tradeoff is it's a narrower claim — it only answers "did this specific secret surface," not "is this app broadly safe." But for the pre-launch check I cared about, I'd rather have a small true answer than a big maybe.
Curious what you've seen on the false-negative side — were the misses more about the judge lacking context, or genuinely rating a bad response as fine?
One thing I'd be careful with:
The interesting question may not be whether deterministic checks produce fewer false positives.
It may be which kinds of trust a security tool ultimately needs to earn.
Those sound similar, but they can lead to very different product decisions over time.