2
3 Comments

"I stopped using AI to judge AI security. Here's what I do instead."

Why I made my LLM security tool show evidence instead of just flagging "risk"

When I started building rojaprove (a pre-launch red-team for LLM apps), the obvious approach was to let an LLM judge whether a response "looks" vulnerable. I dropped that fast.

The problem: LLMs are unreliable at judging vulnerability. You get confident false positives — "this might be exposed" when it isn't, and the reverse. For a security tool, that's worse than useless; it trains you to ignore it.

So I went the other way. rojaprove plants a canary in your system prompt, sends the actual attack probe at your endpoint, and does a deterministic check: did the canary string surface in the raw response, yes or no? No interpretation, no "the AI thinks."

Every finding shows three things: the exact input sent, the raw response received, and the verdict. If it says your prompt leaked, you can see the literal moment it did — your secret sitting in the response text. If it says clean, that's because the canary genuinely never appeared.

It's a smaller claim than "AI-powered vulnerability detection," but it's one I can actually stand behind: either the secret leaked or it didn't, and you can see which.

Free and open source, BYOK, tests only endpoints you own.
github.com/ghkfuddl1327-wq/rojaprove
https://x.com/OHS1327

Curious how others here think about false positives in security tooling — do you trust LLM-as-judge for this, or does it break down for you too?

on June 12, 2026
  1. 1

    The circularity of "AI judging AI" is something I've been thinking about too - curious what signal gap you noticed when the AI evaluator was calling things safe that still turned out to be issues?

    1. 1

      Honestly, I didn't get far enough into LLM-as-judge to collect clean false-negative data myself — I bailed earlier than that. What pushed me off it was the false-positive direction plus the published work on LLMs being near-random at judging paired vulnerable/safe code. Once I saw the judge couldn't reliably tell those apart, I stopped trusting its "safe" verdicts by the same logic — if it's guessing on the positives, a confident "safe" isn't worth much either.

      So I sidestepped the whole signal-gap question instead of trying to close it: plant a known canary, send the probe, check deterministically whether that exact string came back. The "did it leak" question has a ground truth, so there's no evaluator to second-guess.

      The tradeoff is it's a narrower claim — it only answers "did this specific secret surface," not "is this app broadly safe." But for the pre-launch check I cared about, I'd rather have a small true answer than a big maybe.

      Curious what you've seen on the false-negative side — were the misses more about the judge lacking context, or genuinely rating a bad response as fine?

  2. 1

    One thing I'd be careful with:

    The interesting question may not be whether deterministic checks produce fewer false positives.

    It may be which kinds of trust a security tool ultimately needs to earn.

    Those sound similar, but they can lead to very different product decisions over time.

Trending on Indie Hackers
6 weeks solo, 2 rejections, finally live but nobody told me marketing would be this hard User Avatar 93 comments Building ExpenseSpy solo, no funding — launching June 17 on iOS & Android User Avatar 45 comments Hi IH — quick update. The MVP is live. User Avatar 34 comments I built a $5/1k-listing CRE data API because CoStar is overkill for first-pass scans User Avatar 18 comments Day 7: 51 people answered my question. I wasn't ready for what they said. User Avatar 18 comments Building LinkCover – Day 3: Payment is live. No more building, time to sell. User Avatar 15 comments