I'm building PromptPerf to solve a massive problem most AI developers are just beginning to understand: when models get discontinued, your carefully crafted prompts become instantly obsolete.
Think about it - testing ONE prompt properly requires:
• 4 models × 4 temperatures × 10 runs = 160 API calls
• Manual analysis of each result
• Comparing consistency (same prompt: 60% success on Model A vs 80% on Model B)
For apps with dozens of prompts, this means thousands of tests and hundreds of manual hours.
PromptPerf automates this entire process. Our MVP launches in 2 weeks with early access for waitlist members.
Many developers don't realize this crisis is coming - sign up at https://promptperf.dev to help build the solution and provide feedback.
Great! I see this as a very important problem.
Since OpenAI announced ChatGPT o4-mini, they removed ChatGPT o3-mini. o4-mini is better in every task according to the benchmark, but in reality, based on my personal experience and numerous reports on Reddit since the second day of the launch, o4-mini has a significantly higher hallucination rate and also a lot lazier than o3-mini for coding tasks involving several hundreds to a thousand lines of code. This is unfortunately detrimental to my workflow, so I had to make sure not to upgrade to o4-mini. It is not an upgrade but a downgrade.
I think this is highly dependent on tasks because apparently o4-mini is indeed better in many ways. So, it is really important to systematically cross-check the models.