we built a second AI to argue with the first one. here's why

early on building aisa (aisa.to) we noticed our assessor was scoring based on vibes. confident answer? scores drift up. someone hedges? drift down. same actual skill level, different scores.

fix wasn't better prompting. we added a second AI pass that reviews the entire transcript and challenges every score -- "show me the exact quote where they demonstrated this."

what changed:

scores decoupled from confidence, tracked actual evidence instead
caught people who named every tool but couldn't describe using any of them
calibrator sometimes upgrades scores too -- finds things the assessor missed

if you're building AI that judges anything, one model isn't enough. first model gives output. second gives quality.

building at aisa.to -- AI that assesses how well people actually use AI through conversation.

Ozan Dagdeviren

on June 12, 2026

Say something nice to Ozzie…

Post Comment

1

Interesting.

The thing I'd be careful with is that evidence quality and decision quality aren't always the same thing.

A system can become much better at justifying a score while still quietly optimizing for the wrong underlying signal.

That's one of those decisions that tends to matter more than it appears to at first.

aryan_sinh

·
a day ago
·
Reply