16 days ago, I started tracking whether AI models can beat real-money prediction market crowds.
Every morning at 7 am, three AI models: Claude, Grok, and Gemini — independently call live Polymarket markets before resolution. Every call is timestamped and locked. Nothing is edited. Nothing is deleted. Scored by Brier at resolution. We call 20+ data feeds before the models debate for grounding.
Current status: 95 resolved calls, crowd leading, Δ80 divergence active on Iran conflict market.
What I've learned so far:
Building in public with real accountability mechanics is harder than it sounds. On Day 14, we had an infrastructure crisis, NULL rows from model call failures. Instead of hiding it, we posted a public correction notice and fixed the methodology. That decision felt right.
The crowd is winning on Brier right now. That's expected at Day 16. The thesis gets tested at Day 90+.
Stack: Lovable + Supabase + Stripe. Built nights and weekends.
What's next: First API for agent operators, first partnership call next week.
Happy to answer questions about the build, the methodology, or prediction markets generally.
emberfyi.com
Love the nothing edited rule — that kind of transparency is rare.
This is a really cool experiment — especially the commitment to “nothing edited, nothing deleted.” That level of transparency is rare and probably the most valuable part of the whole project.
Also respect for calling out the Day 14 issue publicly. Most people would quietly fix it, but that kind of correction actually builds more trust long-term.
Curious — how are you thinking about the models improving over time?
Are you keeping the setup fixed to test the original hypothesis, or planning to iterate on prompts/data sources as you go?
Also, do you think the real signal will come from where the models disagree with the crowd rather than overall Brier score?
The NULL-rows incident on Day 14 is the part that would have broken me. Posting a public correction instead of quietly patching is the exact discipline that makes a 365-day project survivable — not as a vanity move, but because unresolved guilt is what makes solo builders ghost their own projects around day 30. I've been doing a much smaller build-in-public cadence on my own tiny iOS app, and the unexpected mental benefit has been that bad weeks stop feeling dangerous once they're written down. Thesis-wise: are you planning to publish a visible "methodology change log" on the site, so late arrivals can distinguish "we changed the process on Day 100" from "we moved the goalposts"? That would feel like a strong credibility moat at Day 90+.
the Day 90+ part is important. Short experiments can be noisy, but prediction quality needs enough resolved markets before the signal becomes meaningful.
The no-edit rule is the hard part. Most forecasting experiments fall apart because people quietly update the prediction after the fact, and there's no paper trail.
Curious if you're tracking variance between models or just overall prediction accuracy. From running multi-LLM scoring on Shopify stores, GPTBot, ClaudeBot, and PerplexityBot behave pretty differently on the same data set.
Wondering if your 365-day record surfaces that kind of model drift over time.
The product insight here might be bigger than the benchmark itself.
If you can keep the methodology trusted, the obvious wedge is not “AI predictions” but infrastructure for people who want auditable forecasting workflows. The public record is what makes that interesting.
Ember, building a transparent, 365-day accountability loop for AI forecasting is exactly the kind of "proof of work" that the prediction space currently lacks. By forcing models to call live Polymarket events and scoring them via Brier at resolution, you’re creating a definitive benchmark that moves past the hype and actually quantifies the gap between crowd intelligence and LLM grounding.
I’m currently running Tokyo Lore, a project that highlights high-utility logic and validation-focused tools like yours. Since you’re building the definitive record for AI forecasting accuracy, entering your project could be the perfect way to turn this 365-day validation journey into a winning case study while your odds are at their absolute peak.