"I stopped using AI to judge AI security. Here's what I do instead."

by Lee Ryeong

Why I made my LLM security tool show evidence instead of just flagging "risk"

When I started building rojaprove (a pre-launch red-team for LLM apps), the obvious approach was to let an LLM judge whether a response "looks" vulnerable. I dropped that fast.

The problem: LLMs are unreliable at judging vulnerability. You get confident false positives — "this might be exposed" when it isn't, and the reverse. For a security tool, that's worse than useless; it trains you to ignore it.

So I went the other way. rojaprove plants a canary in your system prompt, sends the actual attack probe at your endpoint, and does a deterministic check: did the canary string surface in the raw response, yes or no? No interpretation, no "the AI thinks."

Every finding shows three things: the exact input sent, the raw response received, and the verdict. If it says your prompt leaked, you can see the literal moment it did — your secret sitting in the response text. If it says clean, that's because the canary genuinely never appeared.

It's a smaller claim than "AI-powered vulnerability detection," but it's one I can actually stand behind: either the secret leaked or it didn't, and you can see which.

Free and open source, BYOK, tests only endpoints you own.
github.com/ghkfuddl1327-wq/rojaprove
https://x.com/OHS1327

Curious how others here think about false positives in security tooling — do you trust LLM-as-judge for this, or does it break down for you too?

Lee Ryeong

on June 12, 2026

Say something nice to SamLee123…

Post Comment

1

This matches what I've seen testing AI tools across a lot of comparison reviews: LLM-as-judge fails exactly where you need it most. Two failure modes that bite: (1) it anchors on surface fluency, so a confidently-worded wrong answer outscores a hedged correct one, and (2) it leans toward agreeing with whatever framing is in the prompt, which for a security probe means it'll happily rate a real leak as 'probably fine.' The canary approach sidesteps both because there's nothing to interpret. The one place I'd still use a model is to generate diverse attack probes, never to grade them.

AtlasHQ

·
2 days ago
·
Reply
1. 1
  
  Both failure modes ring true, and the first one is sneaky: a confident wrong answer outscoring a hedged correct one. A judge that rewards fluency over correctness is actively dangerous for security, because attackers' best payloads tend to be the fluent, confident-sounding ones.
  
  The second is the one that scares me more for this domain — a judge agreeing with the prompt's framing means it'll rate a real leak as 'probably fine' precisely when the response is crafted to look fine. The failure correlates with the attack succeeding. That's the worst possible time to be wrong.
  
  And I think your last line is exactly the right split: model for generating diverse probes, never for grading them. Generation is where variety and creativity actually help, and a wrong/weird probe costs you nothing — you just run it. Grading is where a wrong call costs you everything. Right now my probe corpus is hand-written from public patterns, but using a model to expand the generation side while keeping the verdict deterministic is a direction I keep circling back to. Best of both, without letting the model near the verdict.
  
  Really useful comment — the "generate, don't grade" framing is sticky.
  
  SamLee123
  
  ·
  a day ago
  ·
  Reply
1

Deterministic oracle over LLM-as-judge is the right call anywhere a ground truth exists, and a leaked canary is the cleanest one there is: the string surfaced or it didn't. Where it gets interesting is the class with no oracle to plant. Broken access control doesn't drop a known string in the response, it returns perfectly valid data that just belongs to the wrong person. "Did user A get user B's record" has no canary, since both records are real and well-formed. The deterministic move holds while the bug is "a secret surfaced," and loses its grip once it becomes "who was allowed to ask." Curious whether rojaprove stays in the leak-detection lane on purpose, or whether you've found a way to make the authz class deterministic too.

chalermpon

·
2 days ago
·
Reply
1. 1
  
  You've put your finger on exactly where I drew the line — and yes, it's on purpose.
  
  The way I think about it: I only want to make a deterministic claim where a ground truth exists to check against. "Did this known string surface" has one. "Was this caller allowed to ask" doesn't — not without rojaprove knowing your authorization model, your roles, who should own record B. The moment it needs that, it's no longer reading the response; it's reasoning about your business logic, and I'm back to guessing. So I'd rather stay in the lane where the verdict is honest than stretch the word "deterministic" over a class it can't actually cover.
  
  The broken-access-control example is the perfect illustration: two real, well-formed records, no string to plant. That's genuinely a different problem, and I think it belongs to authz testing that knows your access model — not to a black-box prompt prober. Pretending otherwise would just reintroduce the false-confidence problem I was trying to escape.
  
  Where I think the canary trick can stretch a bit further is other "a secret surfaced" variants — indirect injection (did the planted instruction in a document change the output in a detectable way), or data exfil where you can seed a marker. Still ground-truth-shaped. Authz isn't, and I don't think I should fake it.
  
  Really good framing, though — "who was allowed to ask" vs "what surfaced" is a cleaner way to draw that boundary than I'd had words for. Mind if I borrow it?
  
  SamLee123
  
  ·
  2 days ago
  ·
  Reply
  1. 1
    
    Borrow away, it's yours. And the boundary holds: the canary owns the "a secret surfaced" family, authz sits on the other side because the oracle isn't in the response, it's in your access model. The only way authz goes deterministic is if you feed the tool two real users and assert A can never read B's row, but the moment you do that you've left black-box probing and you're testing the app's own rules. Refusing to stretch "deterministic" over a class it can't cover is the whole reason rojaprove reads as honest.
    
    chalermpon
    
    ·
    20 hours ago
    ·
    Reply
    1. 1
      
      "The oracle isn't in the response, it's in your access model" — that's the cleanest statement of the boundary I've seen, and it's going in how I explain this from now on. A leaked canary is self-evident from the output alone; A-reads-B's-row is only wrong relative to a rule the output can't show you.
      
      And you've put your finger on the exact tradeoff: yes, you can make authz deterministic by seeding two real users and asserting A never reads B's row — but the moment you do, you've stepped out of black-box probing and you're testing the app's own authorization model with privileged setup. That's a legitimate and valuable test, it's just a different tool with a different contract: it needs to know the app's identity and data model, where rojaprove deliberately knows nothing but a URL and a canary. Stretching one tool across both contracts is how you end up with a checkbox that's deterministic in the demo and hand-wavy in production.
      
      So rojaprove stays black-box and leak-shaped on purpose. Honest about the slice it owns, silent about the slice it doesn't.
      
      SamLee123
      
      ·
      14 hours ago
      ·
      Reply
1

The circularity of "AI judging AI" is something I've been thinking about too - curious what signal gap you noticed when the AI evaluator was calling things safe that still turned out to be issues?

IndieHacker07333

·
3 days ago
·
Reply
1. 1
  
  Honestly, I didn't get far enough into LLM-as-judge to collect clean false-negative data myself — I bailed earlier than that. What pushed me off it was the false-positive direction plus the published work on LLMs being near-random at judging paired vulnerable/safe code. Once I saw the judge couldn't reliably tell those apart, I stopped trusting its "safe" verdicts by the same logic — if it's guessing on the positives, a confident "safe" isn't worth much either.
  
  So I sidestepped the whole signal-gap question instead of trying to close it: plant a known canary, send the probe, check deterministically whether that exact string came back. The "did it leak" question has a ground truth, so there's no evaluator to second-guess.
  
  The tradeoff is it's a narrower claim — it only answers "did this specific secret surface," not "is this app broadly safe." But for the pre-launch check I cared about, I'd rather have a small true answer than a big maybe.
  
  Curious what you've seen on the false-negative side — were the misses more about the judge lacking context, or genuinely rating a bad response as fine?
  
  SamLee123
  
  ·
  3 days ago
  ·
  Reply
1

One thing I'd be careful with:

The interesting question may not be whether deterministic checks produce fewer false positives.

It may be which kinds of trust a security tool ultimately needs to earn.

Those sound similar, but they can lead to very different product decisions over time.

aryan_sinh

·
3 days ago
·
Reply