I use GitHub Copilot daily as a senior full-stack developer, mostly on backend-heavy and legacy codebases.
At some point, I realized the problem wasn’t Copilot itself it was how inconsistent the outputs were depending on context and instructions. For migrations, refactors, and production code, “just autocomplete” wasn’t enough.
During a recent AI hackathon, our team built an MVP in one day and won first place. The biggest difference wasn’t the model or framework, but having structured prompt and instruction patterns to guide Copilot step by step.
That experience pushed me to start building CopilotHub a small public directory where I collect:
I’m currently posting and iterating publicly to understand:
Would love feedback from others building with Copilot:
What’s the most frustrating thing you’ve hit when using AI on non-trivial projects?
Your hackathon insight cuts right to it — the delta between 'AI that kinda works' and 'AI you can actually ship' isn't raw model capability, it's structure around the model.
What I've found building on real legacy codebases: the inconsistency problem splits into two distinct layers. One is prompt quality (which is exactly what your directory solves — which instructions reliably produce the output you want). The other is execution consistency — even with a well-crafted prompt, the same instruction produces subtly different code across different context windows, different session states, different times of day. Both are real but they need different solutions.
I'm curious whether your collection is already surfacing patterns around which prompt types suffer most from 'execution drift' vs. which ones are just badly written. From what I've seen, refactoring and migration prompts are the worst for drift — they seem highly sensitive to how much prior context exists in the session when they run.
The biggest frustration I keep hitting: there's no clean way to know if a prompt 'failed' because it was a bad prompt or because execution conditions changed underneath it. A/B testing prompts feels almost meaningless when the baseline itself isn't stable.
What are you seeing so far? Are specific prompt categories consistently misbehaving in ways that aren't just 'needs better wording'?
That frustration is very real — prompts feel powerful in demos but messy once real constraints show up.
At this stage, I’ve seen clarity come from which problem the directory actually removes first — faster setup, fewer bad prompts, or more consistent outputs across projects.
Curious — what’s the one behavior you’re watching to decide if this is worth doubling down on?