2
6 Comments

"I stopped using AI to judge AI security. Here's what I do instead."

Why I made my LLM security tool show evidence instead of just flagging "risk"

When I started building rojaprove (a pre-launch red-team for LLM apps), the obvious approach was to let an LLM judge whether a response "looks" vulnerable. I dropped that fast.

The problem: LLMs are unreliable at judging vulnerability. You get confident false positives — "this might be exposed" when it isn't, and the reverse. For a security tool, that's worse than useless; it trains you to ignore it.

So I went the other way. rojaprove plants a canary in your system prompt, sends the actual attack probe at your endpoint, and does a deterministic check: did the canary string surface in the raw response, yes or no? No interpretation, no "the AI thinks."

Every finding shows three things: the exact input sent, the raw response received, and the verdict. If it says your prompt leaked, you can see the literal moment it did — your secret sitting in the response text. If it says clean, that's because the canary genuinely never appeared.

It's a smaller claim than "AI-powered vulnerability detection," but it's one I can actually stand behind: either the secret leaked or it didn't, and you can see which.

Free and open source, BYOK, tests only endpoints you own.
github.com/ghkfuddl1327-wq/rojaprove
https://x.com/OHS1327

Curious how others here think about false positives in security tooling — do you trust LLM-as-judge for this, or does it break down for you too?

on June 12, 2026
  1. 1

    This matches what I've seen testing AI tools across a lot of comparison reviews: LLM-as-judge fails exactly where you need it most. Two failure modes that bite: (1) it anchors on surface fluency, so a confidently-worded wrong answer outscores a hedged correct one, and (2) it leans toward agreeing with whatever framing is in the prompt, which for a security probe means it'll happily rate a real leak as 'probably fine.' The canary approach sidesteps both because there's nothing to interpret. The one place I'd still use a model is to generate diverse attack probes, never to grade them.

  2. 1

    Deterministic oracle over LLM-as-judge is the right call anywhere a ground truth exists, and a leaked canary is the cleanest one there is: the string surfaced or it didn't. Where it gets interesting is the class with no oracle to plant. Broken access control doesn't drop a known string in the response, it returns perfectly valid data that just belongs to the wrong person. "Did user A get user B's record" has no canary, since both records are real and well-formed. The deterministic move holds while the bug is "a secret surfaced," and loses its grip once it becomes "who was allowed to ask." Curious whether rojaprove stays in the leak-detection lane on purpose, or whether you've found a way to make the authz class deterministic too.

    1. 1

      You've put your finger on exactly where I drew the line — and yes, it's on purpose.

      The way I think about it: I only want to make a deterministic claim where a ground truth exists to check against. "Did this known string surface" has one. "Was this caller allowed to ask" doesn't — not without rojaprove knowing your authorization model, your roles, who should own record B. The moment it needs that, it's no longer reading the response; it's reasoning about your business logic, and I'm back to guessing. So I'd rather stay in the lane where the verdict is honest than stretch the word "deterministic" over a class it can't actually cover.

      The broken-access-control example is the perfect illustration: two real, well-formed records, no string to plant. That's genuinely a different problem, and I think it belongs to authz testing that knows your access model — not to a black-box prompt prober. Pretending otherwise would just reintroduce the false-confidence problem I was trying to escape.

      Where I think the canary trick can stretch a bit further is other "a secret surfaced" variants — indirect injection (did the planted instruction in a document change the output in a detectable way), or data exfil where you can seed a marker. Still ground-truth-shaped. Authz isn't, and I don't think I should fake it.

      Really good framing, though — "who was allowed to ask" vs "what surfaced" is a cleaner way to draw that boundary than I'd had words for. Mind if I borrow it?

  3. 1

    The circularity of "AI judging AI" is something I've been thinking about too - curious what signal gap you noticed when the AI evaluator was calling things safe that still turned out to be issues?

    1. 1

      Honestly, I didn't get far enough into LLM-as-judge to collect clean false-negative data myself — I bailed earlier than that. What pushed me off it was the false-positive direction plus the published work on LLMs being near-random at judging paired vulnerable/safe code. Once I saw the judge couldn't reliably tell those apart, I stopped trusting its "safe" verdicts by the same logic — if it's guessing on the positives, a confident "safe" isn't worth much either.

      So I sidestepped the whole signal-gap question instead of trying to close it: plant a known canary, send the probe, check deterministically whether that exact string came back. The "did it leak" question has a ground truth, so there's no evaluator to second-guess.

      The tradeoff is it's a narrower claim — it only answers "did this specific secret surface," not "is this app broadly safe." But for the pre-launch check I cared about, I'd rather have a small true answer than a big maybe.

      Curious what you've seen on the false-negative side — were the misses more about the judge lacking context, or genuinely rating a bad response as fine?

  4. 1

    One thing I'd be careful with:

    The interesting question may not be whether deterministic checks produce fewer false positives.

    It may be which kinds of trust a security tool ultimately needs to earn.

    Those sound similar, but they can lead to very different product decisions over time.

Trending on Indie Hackers
6 weeks solo, 2 rejections, finally live but nobody told me marketing would be this hard User Avatar 118 comments Building ExpenseSpy solo, no funding — launching June 17 on iOS & Android User Avatar 46 comments I built a $5/1k-listing CRE data API because CoStar is overkill for first-pass scans User Avatar 18 comments Building LinkCover – Day 3: Payment is live. No more building, time to sell. User Avatar 15 comments I just wanted to taste AI coding tools. A week passed. User Avatar 14 comments I Was Bypassing Every App Blocker, So I Built One That Fights Back User Avatar 11 comments