2
1 Comment

we built a second AI to argue with the first one. here's why

early on building aisa (aisa.to) we noticed our assessor was scoring based on vibes. confident answer? scores drift up. someone hedges? drift down. same actual skill level, different scores.

fix wasn't better prompting. we added a second AI pass that reviews the entire transcript and challenges every score -- "show me the exact quote where they demonstrated this."

what changed:

  • scores decoupled from confidence, tracked actual evidence instead
  • caught people who named every tool but couldn't describe using any of them
  • calibrator sometimes upgrades scores too -- finds things the assessor missed

if you're building AI that judges anything, one model isn't enough. first model gives output. second gives quality.

building at aisa.to -- AI that assesses how well people actually use AI through conversation.

on June 12, 2026
  1. 1

    Interesting.

    The thing I'd be careful with is that evidence quality and decision quality aren't always the same thing.

    A system can become much better at justifying a score while still quietly optimizing for the wrong underlying signal.

    That's one of those decisions that tends to matter more than it appears to at first.

Trending on Indie Hackers
6 weeks solo, 2 rejections, finally live but nobody told me marketing would be this hard User Avatar 107 comments Building ExpenseSpy solo, no funding — launching June 17 on iOS & Android User Avatar 45 comments I built a $5/1k-listing CRE data API because CoStar is overkill for first-pass scans User Avatar 18 comments Day 7: 51 people answered my question. I wasn't ready for what they said. User Avatar 18 comments Building LinkCover – Day 3: Payment is live. No more building, time to sell. User Avatar 15 comments I Was Bypassing Every App Blocker, So I Built One That Fights Back User Avatar 11 comments