early on building aisa (aisa.to) we noticed our assessor was scoring based on vibes. confident answer? scores drift up. someone hedges? drift down. same actual skill level, different scores.
fix wasn't better prompting. we added a second AI pass that reviews the entire transcript and challenges every score -- "show me the exact quote where they demonstrated this."
what changed:
if you're building AI that judges anything, one model isn't enough. first model gives output. second gives quality.
building at aisa.to -- AI that assesses how well people actually use AI through conversation.
Interesting.
The thing I'd be careful with is that evidence quality and decision quality aren't always the same thing.
A system can become much better at justifying a score while still quietly optimizing for the wrong underlying signal.
That's one of those decisions that tends to matter more than it appears to at first.