How do you evaluate if an AI answer is actually good?
Right now most people rely on:
• "Looks good to me"
• Basic prompt tweaks
• Trial and error
I'm experimenting with a tool that scores AI decisions from 0–100 based on clarity, risk and reasoning.
Would love to hear how others are handling this.
Scoring the output is hard when the input is a freeform text blob. If the answer is wrong, you can't tell if it's the role, the missing constraints, or the output format that caused it. Everything is entangled.
The thing that helped me most was structuring prompts into typed semantic blocks before evaluating. Role separate from objective, constraints separate from examples. When you get a bad score, you swap one block, re-run, and see if the score moves. That's a testable input unit, not a wall of text.
I built flompt (https://flompt.dev) around this idea: decompose a prompt into 12 typed blocks, compile to XML. Evaluation becomes much cleaner when the input has structure. Open-source: github.com/Nyrok/flompt
If you find it useful, a star on github.com/Nyrok/flompt would mean a lot. Solo open-source project, every star helps with visibility.