1
1 Comment

Question for builders using AI in their products

How do you evaluate if an AI answer is actually good?

Right now most people rely on:

• "Looks good to me"
• Basic prompt tweaks
• Trial and error

I'm experimenting with a tool that scores AI decisions from 0–100 based on clarity, risk and reasoning.

Would love to hear how others are handling this.

on March 11, 2026
  1. 1

    Scoring the output is hard when the input is a freeform text blob. If the answer is wrong, you can't tell if it's the role, the missing constraints, or the output format that caused it. Everything is entangled.

    The thing that helped me most was structuring prompts into typed semantic blocks before evaluating. Role separate from objective, constraints separate from examples. When you get a bad score, you swap one block, re-run, and see if the score moves. That's a testable input unit, not a wall of text.

    I built flompt (https://flompt.dev) around this idea: decompose a prompt into 12 typed blocks, compile to XML. Evaluation becomes much cleaner when the input has structure. Open-source: github.com/Nyrok/flompt

    If you find it useful, a star on github.com/Nyrok/flompt would mean a lot. Solo open-source project, every star helps with visibility.

Trending on Indie Hackers
Stop Spamming Reddit for MRR. It’s Killing Your Brand (You need Claude Code for BuildInPublic instead) User Avatar 182 comments What happened after my AI contract tool post got 70+ comments User Avatar 128 comments How to build a quick and dirty prototype to validate your idea User Avatar 53 comments Where is your revenue quietly disappearing? User Avatar 51 comments The Quiet Positioning Trick Small Products Use to Beat Bigger Ones User Avatar 40 comments I Thought AI Made Me Faster. My Metrics Disagreed. User Avatar 38 comments