"I stopped using AI to judge AI security. Here's what I do instead."

by Lee Ryeong

Why I made my LLM security tool show evidence instead of just flagging "risk"

When I started building rojaprove (a pre-launch red-team for LLM apps), the obvious approach was to let an LLM judge whether a response "looks" vulnerable. I dropped that fast.

The problem: LLMs are unreliable at judging vulnerability. You get confident false positives — "this might be exposed" when it isn't, and the reverse. For a security tool, that's worse than useless; it trains you to ignore it.

So I went the other way. rojaprove plants a canary in your system prompt, sends the actual attack probe at your endpoint, and does a deterministic check: did the canary string surface in the raw response, yes or no? No interpretation, no "the AI thinks."

Every finding shows three things: the exact input sent, the raw response received, and the verdict. If it says your prompt leaked, you can see the literal moment it did — your secret sitting in the response text. If it says clean, that's because the canary genuinely never appeared.

It's a smaller claim than "AI-powered vulnerability detection," but it's one I can actually stand behind: either the secret leaked or it didn't, and you can see which.

Free and open source, BYOK, tests only endpoints you own.
github.com/ghkfuddl1327-wq/rojaprove
https://x.com/OHS1327

Curious how others here think about false positives in security tooling — do you trust LLM-as-judge for this, or does it break down for you too?

Lee Ryeong

on June 12, 2026

Say something nice to SamLee123…

Post Comment

1

The circularity of "AI judging AI" is something I've been thinking about too - curious what signal gap you noticed when the AI evaluator was calling things safe that still turned out to be issues?

IndieHacker07333

·
17 hours ago
·
Reply
1. 1
  
  Honestly, I didn't get far enough into LLM-as-judge to collect clean false-negative data myself — I bailed earlier than that. What pushed me off it was the false-positive direction plus the published work on LLMs being near-random at judging paired vulnerable/safe code. Once I saw the judge couldn't reliably tell those apart, I stopped trusting its "safe" verdicts by the same logic — if it's guessing on the positives, a confident "safe" isn't worth much either.
  
  So I sidestepped the whole signal-gap question instead of trying to close it: plant a known canary, send the probe, check deterministically whether that exact string came back. The "did it leak" question has a ground truth, so there's no evaluator to second-guess.
  
  The tradeoff is it's a narrower claim — it only answers "did this specific secret surface," not "is this app broadly safe." But for the pre-launch check I cared about, I'd rather have a small true answer than a big maybe.
  
  Curious what you've seen on the false-negative side — were the misses more about the judge lacking context, or genuinely rating a bad response as fine?
  
  SamLee123
  
  ·
  17 hours ago
  ·
  Reply
1

One thing I'd be careful with:

The interesting question may not be whether deterministic checks produce fewer false positives.

It may be which kinds of trust a security tool ultimately needs to earn.

Those sound similar, but they can lead to very different product decisions over time.

aryan_sinh

·
18 hours ago
·
Reply