1
3 Comments

Question for builders using AI in their products

How do you evaluate if an AI answer is actually good?

Right now most people rely on:

• "Looks good to me"
• Basic prompt tweaks
• Trial and error

I'm experimenting with a tool that scores AI decisions from 0–100 based on clarity, risk and reasoning.

Would love to hear how others are handling this.

on March 11, 2026
  1. 1

    Pass/fail on outcomes beats scoring prose. Did it complete? Did it break anything? Did it cost what you expected? Those three tell you more than a quality score ever will.

  2. 1

    What worked for us was a simple 3-layer eval loop:

    1. Task success metric (binary): did it complete the user’s goal?
    2. Risk metric (0–5): hallucination / policy / financial risk
    3. Cost metric: tokens + latency per successful run

    Then optimize for success per dollar, not just answer quality.

    Most teams miss layer 3, so they ship “great” answers that are too expensive to scale. We started catching this only after tracking token spend in real time per workflow.

  3. 1

    Scoring the output is hard when the input is a freeform text blob. If the answer is wrong, you can't tell if it's the role, the missing constraints, or the output format that caused it. Everything is entangled.

    The thing that helped me most was structuring prompts into typed semantic blocks before evaluating. Role separate from objective, constraints separate from examples. When you get a bad score, you swap one block, re-run, and see if the score moves. That's a testable input unit, not a wall of text.

    I built flompt (https://flompt.dev) around this idea: decompose a prompt into 12 typed blocks, compile to XML. Evaluation becomes much cleaner when the input has structure. Open-source: github.com/Nyrok/flompt

    If you find it useful, a star on github.com/Nyrok/flompt would mean a lot. Solo open-source project, every star helps with visibility.

Trending on Indie Hackers
Stop Spamming Reddit for MRR. It’s Killing Your Brand (You need Claude Code for BuildInPublic instead) User Avatar 183 comments What happened after my AI contract tool post got 70+ comments User Avatar 137 comments Where is your revenue quietly disappearing? User Avatar 56 comments How to build a quick and dirty prototype to validate your idea User Avatar 53 comments The Quiet Positioning Trick Small Products Use to Beat Bigger Ones User Avatar 40 comments I Thought AI Made Me Faster. My Metrics Disagreed. User Avatar 38 comments