I audited an app built 100% with AI. Here's what I found.

by Redion Bufi

A solo founder reached out before their launch. They'd built their entire web and mobile app using Claude Code — fast, functional, clean UI at first glance. They wanted a QA audit before real users touched it.

I logged 40+ issues. 12 were critical.

Here's what broke and why it matters if you're building with AI tools:

Edge cases the AI never considered

Most critical issues were edge cases — empty field submissions, network drops mid-action, special characters in inputs. The AI built exactly what it was told to build. Nobody told it to ask "what if this goes wrong?" So it didn't.
The app either crashed silently, threw a generic unhandled error, or worse — appeared to succeed while doing nothing.

Regression after fixes
The founder went back to Claude Code to fix the reported issues. Several fixes broke adjacent features. A login flow fix broke session handling downstream. A UI fix on one screen misaligned another.
AI fixes what you tell it to fix. Precisely that, nothing more. Without someone tracking the full scope of changes, regression stacks up fast.

No consistency across error states
The same class of error — a failed network request — was handled differently across features. Modal here, inline message there, silence somewhere else. Each individually defensible. Together, an unpredictable experience that erodes user trust.

AI has no memory across prompts. Nobody was holding the whole product in their head. That's a human job.

I'm Redion, founder of QAura — I do QA audits for startups, especially ones building with AI tools. Happy to answer questions or take a look at what you're building.
Are you shipping something built with Cursor, Bolt, or Lovable? Drop it below — I'll give you honest feedback.

Redion Bufi

posted to

Startups

on June 9, 2026

Say something nice to redion…

Post Comment

1

Great writeup, the silent-success failures (looks like it worked, actually did nothing) are the scariest because they pass every casual test. The other big one I keep seeing in Al-built apps is Supabase RLS left off, so any logged-in user can read everyone's data. If it's useful I'm happy to do a quick free look and tell you what's actually exposed before you launch.

TatePrograms

·
3 days ago
·
Reply
1

"appeared to succeed while doing nothing" is the scariest line in here. a crash you can debug. a silent wrong outcome just quietly erodes trust until the user leaves and never explains why.

the regression thing isn't really a code quality problem — it's a "nobody holding the whole product in their head" problem. same thing happens with junior devs, just less consistent.

how many of the 12 criticals did the founder actually fix before launch?

ovidon83

·
a month ago
·
Reply
1. 1
  
  Honestly all 12 needed fixing before launch, and the founder did fix them all before going live. Took a few rounds since some fixes caused the regression issues I mentioned in the post. The crash vs silent wrong outcome point is exactly right too, at least a crash tells you something. Silent failures just lose you the user with no signal anything went wrong.
  
  redion
  
  ·
  a month ago
  ·
  Reply
1

This is one of the most accurate descriptions I've read of what actually breaks when AI generates code without structure behind it. The regression pattern is the one that costs teams the most time. The AI fixes exactly what you ask it to fix, and nothing adjacent. Because it has no model of how the system fits together, only of the code in front of it. The inconsistent error handling across features is the same root cause — no shared spec that says "this is how our system behaves when a network request fails." Each prompt is fresh context, so you get locally reasonable decisions that are globally incoherent. What you're describing as a QA problem is upstream of QA. It starts at requirements. If there's no structured spec that the AI is generating code against, there's no way to verify the output without reading every line. I've seen teams spend more time debugging AI-generated code than they would have spent just writing it. The discipline of spec-first development isn't slowing AI down, it's what makes AI output trustworthy at scale. Have you seen patterns in where the bugs concentrate relative to how the original requirements were written?

guy_powell

·
a month ago
·
Reply
1. 1
  
  Really well put, the inconsistency is upstream of QA, you're right. When there's no shared spec the AI is reasoning against, every prompt invents its own rules. QA ends up catching the symptoms of a missing architecture decision, not a bug in the traditional sense. To your question yes, there's a clear pattern. The bugs concentrate most in features that were described at a high level with no failure modes defined. The more vague the original requirement, the more the AI filled in the gaps with its own assumptions, and those assumptions are where the critical issues lived.
  
  redion
  
  ·
  a month ago
  ·
  Reply
1

The edge case nobody in this thread has named: email delivery itself.
AI builds the signup flow, generates the verification email, the happy path test passes — but did the email actually land in a real inbox? In CI most teams either mock it (lying to yourself) or use a shared Gmail (collision hell in parallel runs).
Built ZeroDrop for exactly this gap — disposable inboxes caught at Cloudflare's edge, OTP auto-extracted, works in Playwright without Docker. Free, no signup required.
The "appears to succeed while doing nothing" failure mode you described is precisely what happens when you fake email delivery instead of testing it for real.

zerodrop

·
a month ago
·
Reply
1. 2
  
  Good addition, email delivery is exactly the kind of thing that gets mocked in testing and forgotten until a real user never receives their verification email. Testing against a real inbox is the only honest check. Will keep ZeroDrop in mind, that "appears to succeed while doing nothing" failure mode is brutal when it's an email that never arrives.
  
  redion
  
  ·
  a month ago
  ·
  Reply
  1. 1
    
    Exactly — and the brutal part is it's invisible. The test is green, the CI passes, the PR merges. Nobody finds out until a real user complains they never got the email.
    Appreciate you flagging it. If you ever audit an app where email testing is part of the scope, happy to help set it up.
    
    zerodrop
    
    ·
    a month ago
    ·
    Reply
    1. 1
      
      That's the worst kind of bug, everything looks green until a real user is stuck on a broken signup. Appreciate the offer, will keep it in mind when email flows are in scope
      
      redion
      
      ·
      a month ago
      ·
      Reply
1

Exactly. And the only way I've found to actually catch it is to assert on the outcome, not the status: did the row really change, did the message really arrive, not just 'did the call return 200'. Status codes lie, state doesn't. That's the one check the AI never writes for itself unless you make it.

worvi26

·
a month ago
·
Reply
1. 1
  
  Assert on the outcome, not the status, that's the rule. The AI writes the happy path check because that's what it was asked to do. Nobody asked it to verify the actual state changed. That one habit closes more silent bugs than anything else.
  
  redion
  
  ·
  a month ago
  ·
  Reply
1

The pattern you're identifying - structural soundness vs. correctness-of-logic drift - shows up consistently in AI-assisted codebases.

For infrastructure-layer code (webhooks, event handling, retry logic), this drift tends to be more dangerous because the failures are silent or delayed. A component that "works" in testing might silently drop events under load because the AI optimized for the happy path.

For goffer.ai (legislative webhook delivery), the pieces that needed the most human review weren't the features - they were the delivery guarantees and idempotency logic. Those are exactly the parts where AI-generated code looks plausible but has subtle invariant violations.

3vo

·
a month ago
·
Reply
1. 1
  
  Webhook and delivery logic is exactly where things look fine in testing but quietly break under real traffic, events get dropped or processed twice and there's no error to tell you something went wrong. The AI writes code that looks right but 'looks right' is not enough when reliability actually matters. That's the kind of thing that needs a human who understands what the end result should be, not just what the code does
  
  redion
  
  ·
  a month ago
  ·
  Reply
  1. 1
    
    The "200 with empty body" case is the one that gets teams. Your system thinks it succeeded, the retry logic never fires, and you find out three hours later when someone checks the DB. The fix isn't smarter code. It's explicitly verifying the outcome, not just the response code. Log the full payload on every incoming event before touching anything. That one rule catches most of the silent failures we've hit.
    
    3vo
    
    ·
    a month ago
    ·
    Reply
1

This matches what I've seen building with AI too — not as a QA auditor, but as the person directing the build.

The "happy path looks done, edge cases don't exist until a human asks" pattern is real. I had the same class of surprise: output looked nearly production-ready on the main flow, but the gaps showed up where nobody had written the failure question yet — empty states, retries, "what if the API returns 200 with empty body."

Your regression point is the one I'd stress to founders. AI fixes the file you point at, not the system. I learned to treat every fix as a mini release: one flow, one verification pass, before touching the next complaint. Without that, adjacent screens drift fast — especially session/auth and layout shared components.

The inconsistent error handling across features also rang true. AI generates each screen in isolation unless someone holds a product-wide error contract (same copy pattern, same retry affordance, same "offline" behavior). That's less a model problem than a missing spec layer.

Genuine question for you: in the audits you've done, do you find founders catch more value from a written test checklist you leave behind, or from you walking through 5–10 critical flows live? Trying to figure out how much QA discipline to bake in before shipping vs after first users.

Appreciate you sharing the 12 severe / 40 total breakdown — that ratio alone is a useful benchmark.

eddwardpark

·
a month ago
·
Reply
1. 1
  
  Both have value but in different ways. The Launch Readiness Report we deliver gives founders something concrete to act on, weak spots, critical paths, what needs fixing before real users show up. The live walkthrough is something we do with some clients at the end, walking through the weak flows together so it clicks in real time. Depends on the client and what they need.
  If you're trying to figure out when to start, even a basic checklist pass before first users beats discovering the gaps from a bad review.
  
  redion
  
  ·
  a month ago
  ·
  Reply
  1. 1
    
    100% agree. The boring checklist pass is the one people skip — then a bad review does the audit for them.
    
    The walkthrough is great when someone needs to feel the weak flows, not just read them. Different jobs.
    
    Thanks for adding that — helpful framing.
    
    eddwardpark
    
    ·
    a month ago
    ·
    Reply
1

I agree with the core point here.A real product still needs human ownership. AI can speed up execution, but it shouldn’t be the thing defining product boundaries, safety decisions, interaction quality, or how edge cases are handled.If you rely on AI to figure out all of that on its own, you lose confidence in what will actually happen once real users start using the product. And in the end, that uncertainty shows up as broken trust and a worse user experience.

Lyxen

·
a month ago
·
Reply
1. 1
  
  Well said, AI handles the execution, humans own the decisions. The moment you hand over the decision making is the moment you lose control of what actually ships.
  
  redion
  
  ·
  a month ago
  ·
  Reply
1

Which part of this was not written with AI?

gillygangopulus

·
a month ago
·
Reply
1. 1
  
  My thoughts, I use the AI to convert them to structured text, the same way you use it to build. The difference is that I still do my testing manually without promp and give the human user experience to the AI builds.
  
  redion
  
  ·
  a month ago
  ·
  Reply
1

The regression point is the one nobody talks about.
You fix one thing, Claude breaks two others. Then you fix those and something else shifts. After a while you're not building anymore — you're just chasing your own tail.
I felt this building my first project. The AI has no idea what it built last session. You're the only one holding the full picture and if you lose track of it for a day, good luck.
How do you scope a QA audit for a solo founder with no budget — is there a minimum viable version of what you do?

isuki_raj

·
a month ago
·
Reply
1. 1
  
  The minimum viable version is a focused pass on the critical paths only like login, core user flow, payments if there are any. Not trying to find every bug, just the things that would hurt most on launch day. Happy to take a quick look at what you're building and give you an honest estimate of what needs attention, no commitment. Just drop the link.
  
  redion
  
  ·
  a month ago
  ·
  Reply
  1. 1
    
    i'm building a Notion workspace auditor — connects to your workspace and tells you exactly what's cluttered, unused, and slowing you down. still in early stages but happy to share once it's testable. But my landing page is live , but i cannot share the link here as i am new to the indiehacker.
    
    isuki_raj
    
    ·
    a month ago
    ·
    Reply
    1. 1
      
      Sounds like a useful tool, Notion workspaces get messy fast. No worries on the link, just DM me when it's ready for testing and I'll take a look.
      
      redion
      
      ·
      a month ago
      ·
      Reply
      1. 1
        
        Sure . Thanks. Now I am building audience and in my validation stage . If i gets the demand , then I will make the MVP. But getting audience is soo hard . https://notion-audit-ruddy.vercel.app/
        
        isuki_raj
        
        ·
        a month ago
        ·
        Reply
        
        1
        
        Took a quick look, clean and simple for a validation landing page. One small thing: the cursor shows as a big cross instead of the normal pointer. Worth fixing, feels off when you're hovering around the page.
        
        redion
        
        ·
        a month ago
        ·
        Reply
        
        1
        
        Thanks . I wil surely upate the cursor.
        
        isuki_raj
        
        ·
        a month ago
        ·
        Reply
1

Great audit — this mirrors what I've seen too. AI-built apps often nail the surface layer but struggle with error handling and edge cases. The 'happy path only' problem is real. The interesting shift happening now is that experienced devs using AI as a tool (not a replacement) produce much cleaner output than pure vibe-coding. The audit checklist approach you mentioned is exactly what teams need before shipping AI-generated code to production.

darshilwebix8

·
a month ago
·
Reply
1. 1
  
  Exactly, the tool vs replacement mindset makes all the difference. Experienced devs using AI still bring the judgment layer, they just move faster. Vibe coders get the speed without the experience to know what to question. That gap is where the bugs live.
  
  redion
  
  ·
  a month ago
  ·
  Reply
1

Can you audit mine?

Voidwatcher87

·
a month ago
·
Reply
1. 1
  
  Yes sure, drop the link I can take a quick look.
  
  redion
  
  ·
  a month ago
  ·
  Reply
1

The mobile-specific audit I’d add: can a real user understand the permission model before the first scary OS dialog? For Kinetic Override, the hard part is not the Android macro-recorder feature list — it is explaining “local profiles, no account, no ads” clearly enough that a power permission does not feel sketchy.

herold33

·
a month ago
·
Reply
1. 1
  
  That's a real UX gap that gets missed a lot , the permission dialog arrives before the user fully trusts the app, so the framing before it matters as much as the permission itself. 'Local profiles, no account, no ads' upfront does a lot of work to make a powerful permission feel safe rather than sketchy.
  It's less of a QA issue and more of a trust design issue, but it absolutely affects whether users complete onboarding.
  
  redion
  
  ·
  a month ago
  ·
  Reply
1

The consistency-across-error-states point is the one I'd underline. I run a review site solo and lean on AI for a lot of the build, and I hit the same thing — not in code, but in content. Each AI-generated page is individually fine, but pull 20 of them together and the voice drifts, the structure wanders, the internal logic stops matching. Same root cause you named: no memory across prompts, nobody holding the whole thing in their head. The fix that's worked for me is a written spec the AI has to conform to every single time — turns "build me a thing" into "build me this thing, these rules." Curious whether your audits ever surface that kind of systemic drift, or whether it's almost always discrete bugs?

smarttrendsai

·
a month ago
·
Reply
1. 1
  
  Both honestly, and the systemic drift is harder to report because it's not a discrete bug you can point at. It's more like 'this feature feels different from that one' , same app, different personality depending on which prompt built it.
  I do flag it in audits but founders find it harder to prioritize than a broken button. The written spec as a conformance document is the right fix, same principle applies to code as to content, that gives the AI the rules once and reference them every time.
  
  redion
  
  ·
  a month ago
  ·
  Reply
  1. 1
    
    Yeah, that tracks. A broken button is obvious and gets fixed; "this part feels off" sits in the backlog forever because nobody can put their finger on it. What's worked for me is reframing drift as a real cost — inconsistent voice or UX quietly erodes trust the same way a bug does, users just can't articulate why they bounced. Harder to sell to a founder than a crash, but it's the thing that decides whether the whole thing feels like one product or ten stitched together.
    
    smarttrendsai
    
    ·
    a month ago
    ·
    Reply
    1. 1
      
      Yes that's right, users won't say 'your error states are inconsistent,' they'll just say the app feels rough and move on. The cost is real, it's just invisible on a bug tracker. That's why I flag it in audits even when founders deprioritize it.
      
      redion
      
      ·
      a month ago
      ·
      Reply
1

The regression issue you described is the one that surprises founders the most. With a human codebase you build up an implicit mental model of what touches what. With AI-generated code that mental model doesn't exist — each fix is essentially context-free unless you explicitly scope it.

One thing that helped on a project I QA'd: a lightweight "change impact" prompt before any fix. Force Claude to list every component that shares state with the thing being changed before touching it. Slows down iteration but the regression rate dropped significantly.

The edge cases point is spot on too. AI optimizes for the happy path by default. You have to explicitly prompt for adversarial scenarios — empty inputs, network timeouts, race conditions — or they simply don't exist in the output.

Good writeup, would be curious what percentage of the 12 critical issues were detectable with basic automated tests vs required manual QA.

BrandPulseHQ

·
2 months ago
·
Reply
1. 1
  
  Roughly half could have been caught with automated tests if they existed, empty state failures, broken flows on specific conditions, a few consistency issues. The other half needed manual QA because they required judgment, knowing that a certain combination of inputs felt wrong even when the app didn't crash. The change impact prompt before every fix is a smart habit, forcing the AI to map what it's about to touch before touching it is basically the system model it doesn't have by default.
  
  redion
  
  ·
  a month ago
  ·
  Reply
1

The point about edge cases resonates. AI is surprisingly good at building the happy path, but users are experts at finding every path you did not think about.

One thing I've noticed is that most of these issues are symptoms of missing product thinking rather than bad code generation. If you do not define what should happen when a payment fails, a request times out, or a user refreshes mid workflow, the implementation becomes guesswork whether it's written by AI or a human.

The founders getting the most out of AI seem to treat it like a very fast engineer, not a product owner. The speed advantage is real, but someone still has to own the requirements, testing strategy, and system behavior across the whole product. That's usually where quality is won or lost.

muhammadtanveerabbas

·
2 months ago
·
Reply
1. 1
  
  Exactly, AI is a fast executor, not a decision maker. The quality gap isn't in the code, it's in the decisions nobody made before the code was written. Founders who get that and stay in the product owner seat get much better results than ones who hand the wheel over completely.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1
Strongly agree — and as someone who builds with AI but came from a dev background, you've nailed why it happens. Edge and error cases don't get handled because the person prompting has never been burned by them. If you've never watched a network call fail mid-write and corrupt state, it doesn't occur to you to guard against it. The root is the experience gap, not the tool.
Two things that have helped me, especially for non-dev vibe-coders:
1. Keep a standing "things that go wrong" checklist as a markdown file — empty inputs, network drops, special characters, double-submits, expired auth — and feed it to the AI as a rules file so it's in context every prompt, instead of relying on memory. It doesn't replace a human holding the whole product in their head, but it encodes the part that IS checklist-able.
2. Centralize error handling in one function instead of per-feature. That hits your third point directly — one source of truth for "what happens when something fails," so the AI follows one pattern instead of reinventing it per screen.
  Neither removes the need for an audit like yours — they just clear the noise so QA can focus on the genuinely hard stuff. Great write-up.
nocturne9no1

·
2 months ago
·
Reply
1. 1
  
  The standing 'things that go wrong' checklist as a rules file is a really practical idea, encodes the experience the AI doesn't have and keeps it consistent across sessions. And centralizing error handling is exactly right, one pattern the AI can follow beats six different ones it invented per screen.
  These don't replace a QA pass but they make the audit much cleaner when it happens. Thanks for sharing these.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1
Setting the foundation and architecture is equally important while building the app through coding agents. I too have built the app using coding agents and here are some of the experiences I would like to share.
1. Before writing any code, work on user stories and try to define a framework for unifying the way app should handle. i.e design languages , input validation, security
2. Give this input to the agent , ask for the plan. It could be a technical/function documents. This is the main artefact as it will set the direction. Review it and lock it
3. In all the prompting sessions give this as the reference document and instruct to adhere to this direction without deviating. Prompt structuring or Skills in claude can help acheive this.
4. After the implementation ask the agent to do a scan for security vulnerability.
I would not say it will give 100% output , but definitely will minimize the issues.

One thing to note is Coding agents are capable of building entire app in a matter of hours. But its the constant review and making sure the implementation is as per the direction we can minimise the risk of having regression issues post builiding a product.
NitinBuilds

·
2 months ago
·
Reply
1. 2
  
  The reference document idea is the key one, most people just start prompting and wonder why things drift. Having a locked spec that every session points back to basically gives the AI the memory it doesn't have by default. Simple but makes a big difference.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

This matches my experience almost perfectly.
I've been building an AI companion for the last several months, and most of the difficult bugs weren't in the AI itself. They were in memory, edge cases, state management, and all the situations real users create without even trying.
One thing I've learned is that AI can generate code surprisingly well, but understanding how a product behaves after weeks of changes is still very much a human responsibility.

HCReal

·
2 months ago
·
Reply
1. 1
  
  Memory and state management across sessions is where AI-built products fall apart the most in my experience, the happy path works great on a fresh account, then real users with weeks of history start hitting all the weird states nobody thought to test. You're right that tracking how a product behaves over time is still fully a human job.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

There is a 4th failure mode that does not show up in technical audits: the spec itself going stale.

Build a healthcare billing feature or restaurant tip-pooling tool with Claude Code, and the AI builds exactly what you specified. But if a federal bill advances and changes the underlying rules, the code passes QA and the product is functionally wrong. Nobody told the AI to ask "what if the law changes."

It is why I am building BillWatch (billwatch-landing.vercel.app) - plain-English federal bill alerts when something moves through Congress that affects your industry. If you are building in a regulated space, this is the layer to add before launch.

3vo

·
2 months ago
·
Reply
1. 1
  
  That's a real gap and an underrated one, technical QA catches broken code, not outdated logic. A feature can pass every test and still be wrong because the rules it was built on changed.
  Billing and tip pooling are exactly the kind of areas where that bites.
  
  Took a quick look on your website, clean landing page honestly. One small thing: the FAQ items need two clicks to expand on the first try. Minor but worth fixing, first impressions count on a landing page.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
  1. 1
    
    That FAQ bug is on the list. Two clicks on first expand is the kind of thing that gets missed in testing because you already know what you're clicking. Good catch. The 'passes every test, still wrong' problem is the harder one -- you need tests that verify intent, not just behavior. That's the part AI-generated code gets wrong most consistently.
    
    3vo
    
    ·
    a month ago
    ·
    Reply
    1. 1
      
      Exactly, testing behavior is easy, testing intent is hard. A test can confirm the form submitted successfully but it can't confirm it should have been allowed to submit in the first place. That judgment still needs a human who understands what the product is actually supposed to do.
      
      redion
      
      ·
      a month ago
      ·
      Reply
      1. 1
        
        The intent gap is the one that's hardest to paper over with process. You can add more code review, more tests, more linting -- but if the thing being reviewed is already 10 layers of abstraction away from what the founder actually had in their head, the human reviewer is just confirming it's internally consistent, not that it's right. That's when the bugs get shipped with high confidence.
        
        3vo
        
        ·
        a month ago
        ·
        Reply
        
        1
        
        High confidence, wrong output, that's the scariest combination. Everything looks reviewed and solid, it just never matched the original intent. By the time someone catches it, it's already in production.
        
        redion
        
        ·
        a month ago
        ·
        Reply
1

The regression part is the one people underestimate. I build WhatsApp bots with AI assistance and the pattern is always the same: the fix works, but it quietly breaks something two features away because the model only holds the slice you showed it. The "appeared to succeed while doing nothing" failure mode is brutal with messaging APIs specifically - the API accepts your call, everything looks fine in the demo, and messages silently drop in production. Boring smoke tests after every change saved me more than anything clever. Curious: after your audits, do founders actually adopt a process, or do most just keep prompting until the symptoms disappear?

worvi26

·
2 months ago
·
Reply
1. 1
  
  Honestly mixed. Some founders take the audit as a wake up call and build real habits around it. Most go back to prompting until something breaks again.
  The ones who stick with it are usually the ones who got burned badly enough once a bad launch, an angry user, lost data. Pain is a better teacher than a report.
  The boring smoke test after every change is underrated exactly because it's boring, easy to skip when you're moving fast
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

This is an incredibly accurate breakdown of the "AI-generated debt" trap.
The regression issue is the absolute worst. Because AI tools currently lack a holistic mental model of the entire codebase, asking it to fix a bug in Component A often silently wrecks state management or routing downstream in Component B.

It feels like building with AI actually shifts the founder's primary role from writing code to writing highly robust unit and integration tests. If you don't have a solid testing suite before letting Claude or Cursor loose on your repo, you're essentially building a house of cards.

Great write-up, Redion. Extremely timely for the current bootstrap landscape.

luhosoulducvm

·
2 months ago
·
Reply
1. 1
  
  Thanks! And yes, house of cards is the perfect way to put it. The speed AI gives you is real, but without tests to back it up you're just stacking faster. The founder's job shifts from writing code to making sure what got written actually holds together.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

The "AI fixes what you tell it to fix, precisely that" line is the real insight. The problem isn't the code quality — it's that there's no single brain holding the full product context across sessions. That's always been a human job, AI just makes it easier to forget that.

Sandy_0517

·
2 months ago
·
Reply
1. 1
  
  Exactly, AI didn't create the problem, it just made it easier to skip the person whose job it was to prevent it.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

One practical pattern that helps is to make the AI produce a risk map with every change: touched flows, adjacent flows likely affected, and 3-5 weird inputs to retest. Then run that as a manual checklist before asking for more code. It forces the builder to keep product-level context instead of treating each prompt as an isolated task.

fredbuilds

·
2 months ago
·
Reply
1. 1
  
  that's a smart habit, forcing the AI to surface its own blast radius before you move on. Most people skip straight to the next prompt. Taking 2 minutes to get a risk map first changes the whole dynamic from reactive fixing to deliberate shipping.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
  1. 1
    
    Exactly. The useful part is making the model slow down and name the neighboring assumptions. Once those are visible, the founder can choose what is actually worth retesting before the next change.
    
    fredbuilds
    
    ·
    2 months ago
    ·
    Reply
    1. 1
      
      Yes half the value is just making the risk visible before you move. Once you can see what might be affected you can make a real decision about what to retest instead of just hoping nothing broke.
      
      redion
      
      ·
      a month ago
      ·
      Reply
      1. 1
        
        Yep. A small risk map turns QA from “test everything again” into “test the few places this change could realistically touch.” That’s the difference between slowing down and staying in control.
        
        fredbuilds
        
        ·
        a month ago
        ·
        Reply
1

the regression point is the one that doesn't get enough attention. everyone talks about whether AI can build the thing, but the harder question is whether the person directing it can catch when a fix in one place quietly breaks something else. that's not a prompting skill, it's a systems thinking skill — and it's the part most "vibe coders" skip entirely

Ozzie

·
2 months ago
·
Reply
1. 1
  
  Exactly, prompting is learnable in a weekend, systems thinking takes years. Knowing that fixing the login flow might touch session handling downstream isn't something you can prompt your way into. That is the gap that gets products into trouble and it's the hardest one to close quickly.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

The biggest gap I see in AI-built apps is around real-device edge cases, not the first version of the UI. For Kinetic Override, AI can help with copy or scaffolding, but Android 15 accessibility flows, permissions wording, and gesture timing all need boring manual testing on actual phones.

herold33

·
2 months ago
·
Reply
1. 1
  
  Real device testing is something you cannot skip, emulators miss too much. Gesture timing, permission flows, screen size differences, they all behave differently on actual hardware. At QAura we always test on real devices across different screen sizes for exactly this reason. Some bugs only show up when a real person is holding a real phone.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

This is why I don't think AI will replace developers anytime soon it mostly shifts where the work happens.
Building the first version is dramatically faster now. Maintaining quality is not.
The scary part is that many founders mistake "it works" for "it's production-ready." The happy path works, the demo looks great, and then real users start doing unexpected things.
I've seen AI generate features in minutes that would have taken days to build manually, but a single overlooked edge case can destroy user trust faster than any missing feature.
My takeaway: AI is becoming a force multiplier for development, but QA, product thinking, and system design are becoming even more valuable.
When you audit AI-built products, what's the most common issue you find that founders don't even realize is a problem?

Vu_Tram

·
2 months ago
·
Reply
1. 1
  
  I agree, AI will not replace developers but it may affect the junior Dev, but I have not seen less jobs for Dev, QA since the AI boom but more jobs are showing daily.
  
  The most common issue founders don't realize is a problem is silent success, the app appears to complete an action but nothing actually happened. No error, no feedback, just a spinner that stops and a user who doesn't know if their data was saved, their payment went through, or their form submitted. Founders test the happy path and it works, so they ship.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
2. 1
  
  Good points here, security issues seem to be quite common with AI-built products. And, in comparison to an edge case of a flawed input breaking a certain page, once founders realize there is some sort of security issue, it is already a major vulnerability.
  
  padoru
  
  ·
  2 months ago
  ·
  Reply
1

The regression-stacking point is the one I felt most. Early on I'd send a fix back to Claude Code and it would do exactly that, fix the one thing. And quietly break something adjacent I didn't think to re-check. What helped was treating the AI as something I don't trust by default: I started freezing the core modules and running an audit pass after every substantial change rather than only at the end, so a fix couldn't silently move something downstream without me seeing it. The "nobody's holding the whole product in their head" line is exactly it. The model checks precisely what you point it at, so the value isn't the code, it's being the one person who's suspicious of all of it.
Curious where you draw the line between what you automate in a QA pass and what genuinely needs a human who understands the product's intent. Because some of the worst bugs I've hit weren't broken code, they were code working perfectly as told, but toward the wrong thing.

tiagooliveira

·
2 months ago
·
Reply
1. 1
  
  For automation I keep it to the critical paths only, the flows that would hurt most if they broke. Everything else I test manually and deliberately loose, no fixed inputs. Same form test might get a long string one day, empty fields another, special characters when I'm feeling creative.
  Exploratory testing with no script is where the interesting bugs live. Automation tells you what broke, loose manual testing finds what was never right to begin with
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

Great write-up. This week I brought the topic of vibecoding to one of my professors and she told me that for another class they had been using an app built with vibe coding that had all those issues plus security issues. In my case I'm building and app with using Claude and I didn't consider edge cases.

Josuefc

·
2 months ago
·
Reply
1. 1
  
  That's a really common pattern, the app looks and feels complete so nobody thinks to question it until something breaks in front of a real user.
  The good news is edge cases are the easiest thing to add once you're aware of them. Before your next release just ask yourself 'what happens if the user does something unexpected here' for each core flow, empty inputs, bad data, slow connections. That one habit catches a lot.
  Good luck with the build and if you need some help ping me I can give a quick look.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

This matches what I've seen while building with AI tools.

AI is incredibly good at getting you from idea to MVP fast, but it doesn't automatically replace engineering discipline. The first 80% of development becomes much faster, while the last 20% testing, edge cases, error handling, and consistency still requires careful human review.

One thing I'd add is that AI-generated code isn't inherently lower quality. The issue is that many founders now skip the QA and testing phases because AI makes shipping so fast. If you treat AI as a junior developer rather than a complete engineering team, the results are much better.

Speed is no longer the bottleneck. Reliability is.

Johin

·
2 months ago
·
Reply
1. 1
  
  Speed is no longer the bottleneck, reliability is, that's the whole post in one sentence honestly.
  And the junior developer framing is exactly right. You wouldn't ship code from a junior dev without review, the same discipline applies.
  The problem is AI makes it feel like you already have a senior engineer on the team, so the review step gets skipped. That's where things go wrong
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

Spot on write-up, Redion! The point about AI missing silent crashes or empty field submissions is incredibly accurate. It really requires a human eye to anchor the context and force error-handling variables.
I’ve been building out a lightweight developer/creator utility suite over at novacrypttcom using a mix of AI assistance and manual frontend refactoring. I'm trying to keep it completely zero-signup and client-side focused. I'd love it if you gave the invoice tool or text styler a quick run-through to see what edge cases I might have missed!

JDS_NOVACRYPTT_TOOLS

·
2 months ago
·
Reply
1. 1
  
  Took a quick look, nice concept with the zero-signup approach. A few things I spotted:
  
  The copy notification is rendering the HTML entity as raw text — shows ✅ instead of the actual checkmark emoji.
  There are empty boxes under the Font App and Instant Video Uploader sections — looks like content or embeds not loading.
  The Premium High Speed Server link is broken.
  Happy to do a more thorough pass if useful, this is exactly the kind of thing a quick QA sweep catches before real users hit it.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

Very fascinating post about the 40 issues and the critical ones. Edge cases. Fascinating. Wow so the fix is broke other features? That's a terrifying. I really appreciate your posting from yesterday the 9th of June and may be interested in chatting. Kevin

listen4wisdom

·
2 months ago
·
Reply
1. 1
  
  Thanks Kevin, glad it resonated! The regression pattern is one of the more surprising things to witness, a clean fix quietly breaking something adjacent with no errors or warnings. Happy to chat anytime, feel free to DM me here or you can reach me directly at qaura.io
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

As a developer building with similar tools, this resonates a lot.

AI helps with implementation speed, but architecture thinking, edge cases, and consistency across flows still require human ownership.

naomihub

·
2 months ago
·
Reply
1. 1
  
  Exactly, AI owns the how, humans still own the what and the why. Speed without that ownership layer is just shipping risk faster
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

The line that matters here is the last one: nobody was holding the whole product in their head. That's not an AI failure, it's a missing role. In every shop I ran, the person who asks "what happens when this breaks" was QA or ops, and that's the exact seat vibe-coders don't fill. The model ships the happy path because the happy path is all anyone described to it.

I'd push a little on "AI has no memory across prompts." That's getting less true every quarter with better context and agents. The durable problem isn't memory, it's ownership: someone has to hold the system as a whole and think in failure modes, and that judgment doesn't fall out of a prompt.

Business angle for you, since you do this for a living: the regression problem is your recurring revenue. A one-time audit is a project. "Re-test every release because AI fixes break things downstream" is a retainer. I'd package it exactly that way, a pre-launch audit plus a standing regression pass, because the pain you're describing repeats on every single ship.

GregoryScottHenson

·
2 months ago
·
Reply
1. 1
  
  The retainer framing is exactly right and honestly the model I'm already building toward at QAura, audit to find what's broken, then a standing regression pass so it stays fixed as you keep shipping. The pain repeats every sprint so a one-time check is never really enough.
  On the memory point and the reply below, agreed tooling is catching up, but the judgment layer is still the gap. A QA agent validating edge cases needs someone to define what the edge cases are first. That ownership problem doesn't get solved by better context windows
  
  redion
  
  ·
  2 months ago
  ·
  Reply
2. 1
  
  "AI has no memory across prompts." Dozens of solutions how to resolve this, and this is where the mindless prompter who wants an app and somebody who knows what is doing the work differently.
  
  I just can't believe that a properly setted QA agent not validating edge cases.
  
  PatientConfidence69
  
  ·
  2 months ago
  ·
  Reply
1

This mirrors my experience almost exactly. I've been building two apps with Rork (similar to Cursor but mobile-first) and the edge case problem is real — the AI builds precisely what you describe, not what you didn't think to describe.
The regression issue caught me badly. A fix to my payment screen broke the IAP flow downstream, which I only discovered during App Store review — not ideal when you're on build 10 of 10 rejections.
Your point about "AI has no memory across prompts, nobody's holding the whole product in their head" — that's the one I'd add to. The solution I found was treating each session like a handoff: summarise the full state before every prompt, not just the problem you're fixing. Slows you down but kills regression.
Shipping on June 23rd — happy to share what else broke if useful.

metaljelly

·
2 months ago
·
Reply
1. 1
  
  App Store review finding your IAP regression on build 10 is painful, that's exactly the kind of adjacent breakage that a pre-submission QA pass would have caught.
  The session handoff approach is smart, essentially forcing the AI to hold the full context before touching anything. Slows things down but that's the cost of not having the mental model natively.
  Would genuinely be curious what else broke, good luck with the June 23rd launch, feel free to tag me
  
  redion
  
  ·
  2 months ago
  ·
  Reply
  1. 1
    
    Thanks, appreciate that — and yeah, happy to share what else broke.
    Just today actually: PrOpinion's AI feature went completely silent — no crash, just a generic "couldn't analyse your situation" fallback. Spent ages checking the app code before realising the actual problem was server-side: the Anthropic API key stored in Supabase had gone bad, returning a 401 that the app quietly swallowed. Edge function logs gave it away in seconds once I thought to check there.
    Similar story on Wump — the paywall screen was just... unresponsive. Buttons did nothing. Turned out to be RevenueCat's getOfferings() returning null with zero error handling, so the UI just sat there looking broken with no indication why.
    Pattern I'm noticing: with AI-built apps, the failure mode is rarely a crash — it's a silent fallback or a dead UI, and the real error is sitting in a log you have to know to check.
    Will definitely tag you if more turns up before June 23rd. Good luck with your build too.
    
    metaljelly
    
    ·
    a month ago
    ·
    Reply
    1. 1
      
      The silent fallback pattern is so common and exactly why 'it works' isn't enough. Both of those bugs would have looked fine in testing, no crash, no obvious error, just a dead UI or a vague message. The logs had the answer but only if you know to check them. That's the kind of thing that needs to be in every pre-launch checklist: force errors to be loud, never swallow them silently. Good catches, and yes when all is ready and you need some extra eyes I am more than happy to help.
      
      redion
      
      ·
      a month ago
      ·
      Reply
1

The edge-case point hits hardest. I've shipped enough AI-built code to know the failure mode: "appears to succeed while doing nothing." That one is the silent killer because no test, no log, no error - just a happy user looking at stale state. I now treat "what does the failure path look like?" as a mandatory step before merging anything Claude wrote.

worvi26

·
2 months ago
·
Reply
1. 1
  
  Making failure paths mandatory before merge is the right instinct, it's the question the AI never asks itself. 'Appears to succeed while doing nothing' is the hardest class of bug to catch precisely because everything looks fine until someone notices the data is wrong
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

This is a really valuable audit. The regression point hits home — AI fixes exactly what you tell it to, and nothing more. I've been building a web scraping platform with AI assistance and ran into the same thing: a fix in one parser silently broke edge cases in another. The "appears to succeed while doing nothing" failure mode is the worst because you don't catch it until a user does. Good reminder to always test the full flow after any fix, not just the reported issue.

hollywoodoo

·
2 months ago
·
Reply
1. 1
  
  The silent success is the worst failure mode, no error, no signal, just wrong data quietly flowing through.
  Full flow testing after every fix is the habit that catches it, even when it feels like overkill at the time.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

The regression point hit hard. AI fixes exactly what you point at and nothing else - no one is holding the full mental model of the product, so every fix is potentially introducing a blind spot somewhere else.
Built zer0email almost entirely with AI tools over the past 1.5 months and the edge case problem is real. The happy path gets built perfectly. It's the "what if the user does something unexpected" scenarios that slip through every time because you have to explicitly think of them first before the AI can help.
The consistency across error states one is underrated too. Each individual decision looks fine in isolation, but when a user hits 3 different error patterns in one session it quietly destroys trust without them even being able to articulate why.
Good reminder to actually QA properly before the next push.

nayan_joshi

·
2 months ago
·
Reply
1. 1
  
  The 'quietly destroys trust without them being able to articulate why' line is exactly it — users don't file bug reports about inconsistent error states, they just leave.
  Good luck with zer0email, 1.5 months of AI-assisted building is a lot of surface area to cover before launch.
  
  Also took a quick look on the website looks great especially the home page.
  A small suggestion, if you do not have any blog is better to hide the session or let it empty rather then having test post.
  Because even when we click on the post we get an error: Application error: a client-side exception has occurred while loading zer0email.com (see the browser console for more information).
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

Proper unit and end‑to‑end tests wouldn't magically give the AI a system model, but they do turn a lot of these “fixed locally, broke globally” failures you mentioned into cheap, deterministic catches... especially those around auth, session lifecycles, and network failure modes that look like they never crash but they silently corrupt the state.

In these AI-heavy codebases I've found the sweet spot is: architecture-first (clear boundaries), plus a small but ruthless E2E suite on critical paths, and property-style+unit tests like “this user can never see that user's data”...

As for the "memory across prompts", that is now becoming more and more arguable, I'm not sure I fully agree with that anymore. That said, I think we'll observe a lot of those things you mentioned, but it's definitely getting better with time.

pmreis

·
2 months ago
·
Reply
1. 1
  
  Totally agree, for anything beyond a one-off audit I always push for a small E2E suite on the critical paths. Nothing complex, just the flows that would hurt most if they broke. Flaky tests are worse than no tests so keeping it tight and reliable matters more than coverage numbers.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

The regression issue you described is the one that resonates most with me. "AI fixes what you tell it to fix. Precisely that, nothing more." — this is the core problem.

The reason it happens: LLMs have no persistent model of the system. Each prompt is stateless. So when you ask it to fix a login flow, it has no awareness that session handling downstream depends on the exact shape of what it just changed. A human engineer holds that dependency graph in their head (or in their docs). The LLM doesn't.

What's interesting is this isn't really a quality problem — it's an architecture problem. The output is only as coherent as the context window you feed it. Most people feed it feature requests, not system models.

The audit approach you're describing — catching these before real users hit them — is actually filling the gap that the missing system model creates. Good service.

anioko1

·
2 months ago
·
Reply
1. 1
  
  Honestly, you articulated better than I did on the post.
  
  'The output is only as coherent as the context window you feed it' is the cleaner framing.
  Most people treat the AI as a feature factory and wonder why the system drifts, the input was never a system model to begin with. The audit fills the gap but you're right it's really compensating for a missing architecture layer.
  
  The interesting question is whether that layer ever gets built into the tooling or stays a human responsibility indefinitely. My bet is it stays human for a while yet.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

May I just drop it. I am building a frontend framework called rikka with vibe. You can find it on npm with @takanashi/rikka-signals or github:yw662/rikka

takanashi

·
2 months ago
·
Reply
1. 1
  Took a look and it's a clean project — the signals approach and LLM-friendly docs are a nice touch. Quick observations from the playground:
  
  Fullscreen button opens at the same size rather than going truly Fullscreen. Also the icon doesn't toggle state — it should switch to a compress icon when in Fullscreen mode.
  
  In light mode some buttons disappear — the minus button on the counter example blends into the background. Contrast issue worth a pass.
  
  Decimal inputs produce inconsistent precision in the preview output — some results show 2 decimals, others show 9+, and longer values overflow the preview area on smaller screens.
  Happy to dig deeper if useful.
  redion
  
  ·
  2 months ago
  ·
  Reply
1

It’s fascinating to see the strengths and pitfalls of AI in app development, especially with edge cases. How did the founder handle the feedback? I'm curious if they considered incorporating human oversight in the QA process post launch to mitigate these issues.

Kadiri

·
2 months ago
·
Reply
1. 1
  
  They took it well honestly, the whole reason they hired QAura before launch was that they already had a gut about something might be off.
  
  The go/no-go framing helped too, it wasn't a list of failures but a clear picture of what needed fixing before real users showed up.
  
  Post-launch they're more deliberate about testing adjacent features after any fix, which is the main habit shift.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

This hits close to home. I built my app mostly with AI help and the edge cases were my biggest headache too — especially network timeouts for async AI calls and consistent error handling.
One thing that helped: treating each AI session like onboarding a new dev. Full context every time, even if it feels repetitive.
The regression issue is real and ongoing honestly

ikoft1

·
2 months ago
·
Reply
1. 1
  
  The 'onboard a new dev every session' framing is a good mental model, it sets the right expectations about what the AI does and doesn't carry forward. And async AI call timeouts are a particularly nasty edge case.
  
  Sounds like you've been through the full gauntlet building this. What's the app if you don't mind sharing?
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

of the 12 critical issues, curious how many were things the founder could have caught themselves with basic testing versus things that required domain knowledge about edge cases to even think to test. asking because there's a difference between 'this app had no QA process at all' and 'this app had thoughtful manual testing but still missed these specific failure modes.' the first is a process problem and the second is an expertise problem and they have different solutions

adin_builds

·
2 months ago
·
Reply
1. 1
  
  Honest breakdown: 4-5 were things basic testing, like empty states, broken flows on specific size, buttons not working on a simple conditions.
  The other issues were a bit more complex that needed someone with the expertise.
  
  But problem with Solo dev is that after some time they know the product and think only of the happy paths and miss the edge cases.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
  1. 1
    
    the 4-5 basic ones being catchable with a simple checklist is actually useful information for founders reading this. a short 'test these things before launch' list covering empty states, form edge cases, and mobile breakpoints would catch almost half the critical issues without needing an external audit. the other half is where expertise matters. those are different problems and probably worth separating in how you position QAura
    
    adin_builds
    
    ·
    a month ago
    ·
    Reply
    1. 1
      
      Yes sure, having a smoke check list with edge cases helps founders a lot to catch the issues before external audit or users, but the 'problem' is that after some time that founders build (solo founders mostly) they think only the happy paths and forget the edge cases, that's were external audit especially before first launch is essential.
      
      redion
      
      ·
      a month ago
      ·
      Reply
1

One QA habit I like for AI-built products is a small failure ledger next to the prompt/spec: every bug gets logged as an invariant the product must now protect, not just as a one-off fix. Example: network failures never look successful, empty states are explicit, auth changes require session smoke tests.

That gives the AI something durable to reason against on the next pass, and it turns regression testing into a growing map of known weak spots. It is boring, but for solo builders it catches the exact class of fixed locally, broke globally issues you're describing.

fredbuilds

·
2 months ago
·
Reply
1. 1
  
  That is really an elegant pattern, turning bugs into invariants rather than just closing them out. It solves the AI memory problem in a practical way by externalizing the product's 'rules' into something you can feed back into the next prompt.
  
  Going to steal this honestly. For solo builders this might be the highest-leverage QA habit that doesn't require any tooling but just discipline.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
  1. 1
    
    Steal away. The nice thing about the ledger is that it forces the lesson to survive the bug report.
    
    The version that’s worked best for me is one line per rule, written as a product invariant instead of a task. Much easier to reuse during the next review pass.
    
    fredbuilds
    
    ·
    2 months ago
    ·
    Reply
    1. 1
      
      One line forces you to actually know what the rule is. If you can't write it in one line, you probably haven't fully understood the bug yet.
      
      redion
      
      ·
      2 months ago
      ·
      Reply
1

We hit the regression problem hard building TetherClaw — agents fix what you tell them to fix, nothing more. I've dealt with this many times through the building process and it always comes back up. Had four bugs surface on launch day, all same-day fixes, but two of them were exactly this: a change in one place broke behavior somewhere adjacent that wasn't obvious.
The "nobody holding the whole product in their head" line is the honest cost of building this way. You learn to compensate for it — more specs, more review, more verification steps — but it never fully goes away.

jonathanrodrigez

·
2 months ago
·
Reply
1. 1
  
  The compensation strategies you mention are real but you're right, they reduce the risk but they don't eliminate it.
  
  The best you can do is build habits that partially substitute for that missing context, specs, checklists, a human doing a full pass before shipping. How's TetherClaw holding up post-launch?
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

the consistency point is the one people miss most. AI treats each prompt as a fresh context, so it makes local decisions that are individually fine but globally incoherent. same error, three different UI patterns.

this is why the 'anyone can build with AI' framing is misleading. anyone can generate code. the skill that matters is being able to look at the output and know whether it's actually ready to ship — which requires the domain knowledge the AI doesn't have.

Ozzie

·
2 months ago
·
Reply
1. 1
  
  Exactly, generating code and shipping software are 2 different things. While the AI can generate great code it still need a human touch to ship a software.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

This resonates a lot.

I’ve been building a mental performance app called VEXIS mostly through AI-assisted workflows, and the regression point is very real.

One thing I noticed quickly was that AI can build features fast, but it doesn’t naturally protect architecture boundaries unless you explicitly force those constraints into the process.

We started introducing isolated systems, migration readiness checklists, static preview boundaries, and stricter separation between production/session/audio logic because fixes in one area kept creating unexpected side effects elsewhere.

The speed is incredible, but I’m realising the real skill isn’t prompting anymore, it’s product thinking, systems thinking, and knowing what should not change.

Curious how you approach QA for AI-built products that are evolving rapidly.

vexis_tek93

·
2 months ago
·
Reply
1. 1
  
  The isolation system and separation between production/session/audio logic is exactly the kind of architectural that makes QA works at speed. Without those boundaries ever Audit is starting from scratch.
  
  For rapidly evolving AI-built products I focus on locking down the critical paths first as auth, core user flows, data integrity and treating those as the non-negotiable regression suite. Everything else can move fast as long as those hold.
  
  Happy to take a look at VEXIS if you ever want a fresh set of eyes on it.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

My experience lines up with the regression point. What's helped most is spending extra time upfront on the architectural pieces, auth, agent workflows, the things
everything else leans on. Get those solid early and adding features later does far less collateral damage to what already works.

The other habit that feels dumb but has been necessary for me: making the AI review its own work constantly. The catch is you can't do it blindly, or it starts
"fixing" things into an over-engineered mess. Same with planning. I now spend real time reviewing the plan before letting it implement, because a bad plan caught
late is where most of my regressions came from.

RaymondNL

·
2 months ago
·
Reply
1. 1
  
  Architecture first is so underrated, when auth or core flows are shaky, every new feature is just regression waiting to happen.
  The self-review habit is interesting though, it only works if you already know what good looks like. Otherwise you're just asking it to validate its own assumptions, which it will do confidently and wrong.
  And yeah, AI is extremely good at implementing a bad plan perfectly.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

The edge case problem is the one I keep seeing too. I've built most of Genie 007 with AI assistance over 15 months, and the pattern is consistent: AI writes code that handles the happy path perfectly. The failure modes it misses are almost always the ones you listed. What helped me was writing test cases for each feature before asking AI to build it, then running those scenarios after. Doesn't eliminate regression but it cuts the critical issues significantly.

AmandaBrown

·
2 months ago
·
Reply
1. 1
  
  15 months of using AI to create Genie 007 is a solid experience, you probably have seen every pattern I mention.
  The test cases before building approach is smart, essentially forcing the spec to include failure modes upfront rather than discovering them after.
  Curious how Genie 007 is holding up now at 15 months in, are you finding regression getting harder to manage as the codebase grows, or does the test-first habit keep it contained?
  
  redion
  
  ·
  2 months ago
  ·
  Reply
  1. 1
    
    Honest answer: yes, regression gets harder. But differently than I expected.
    
    The test-first habit keeps the core logic solid. What it doesn't protect against is the injection layer. Genie 007 injects into live web pages, and sites update their DOM constantly. A site redesign can break something no test anticipated because we never tested against that specific new layout.
    
    What's helped most: 4 or 5 real sites I manually check before every release. Takes about 10 minutes. Catches the integration failures that unit tests can't see.
    
    The codebase is actually more manageable now than at month 8 or 9, which surprised me. Consistent patterns started paying off around month 12.
    
    AmandaBrown
    
    ·
    a month ago
    ·
    Reply
    1. 1
      
      The DOM injection problem is a whole different class of regression, you can't test against a site redesign that hasn't happened yet. The manual spot check on real sites before every release is the right call, no test suite replaces that. And good to hear the consistency pays off around month 12, that's the kind of thing that's hard to see when you're in the middle of it.
      
      redion
      
      ·
      a month ago
      ·
      Reply
      1. 1
        
        Exactly this on the site redesign point. The DOM injection test is essentially a snapshot test, and snapshots go stale every time the target changes its structure. You can automate what you know but you can't automate what you don't know is coming. Month 12 thing is real by the way. That's when you've survived enough edge cases that you start seeing patterns instead of just fires.
        
        AmandaBrown
        
        ·
        a month ago
        ·
        Reply
1

The regression-after-fix pattern worries me most in AI-only codebases. AI fixes exactly what you tell it to fix and nothing else, you nailed it. One thing I've noticed when I dictate prompts into Claude Code with DictaFlow instead of typing them is that speaking forces me to be concise. Typing tends to produce over-specific prompts that still miss edge cases. But when you speak the intent naturally, the prompt gets shorter and the AI has to fill in more, which is better and more dangerous depending on who's reviewing the output. Either way, the human who's holding the whole product in their head is still the job that matters most.

ryanshrott

·
2 months ago
·
Reply
1. 1
  
  That's really interesting, never thought about it that way but it makes sense.
  Typed prompts tend to be over-engineered on the feature and under-specified on the failure modes, whereas spoken intent is higher level and forces the AI to interpret more. Both paths still land in the same place though: someone needs to verify what actually got built matches what was intended, and that the interpretation didn't quietly skip the hard parts.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

Using AI tools requires developers to have strong logical thinking skills, and I've encountered many pitfalls along the way.

ahui112233

·
2 months ago
·
Reply
1. 1
  
  100% — AI amplifies what you already know. Strong fundamentals mean you catch the gaps before they ship. Without them, the tool just builds the wrong thing faster and more confidently.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

That's the type of issues we - programmers before AI, used to fix on our own, cause we broke it on our own.

I think that's why we know where to look, what to focus on, whereas people coming into the industry and shipping products fully with AI, are not aware of the complexity of simple features like: login, session, authentication.

From their perspective - it just works. They've done it multiple times when browsing web. But to build something that doesn't break - that's a countless iteration.

No hating on vibe coders though - pure love for all - I am actually using AI more than ever now and I know how hard it is to build something solid from scratch.

Peter_Karpinsky

·
2 months ago
·
Reply
1. 1
  
  Yes exactly, there is something about having personally caused a session bug at 2am that makes you never forget to test it again.
  
  You learn where to look by having broken it first. And you nailed it, a good auth is invisible. Nobody ever says 'wow, logging in worked great today.' They only notice when it doesn't.
  
  No hate on vibe coders from me either, the speed they can ship is genuinely impressive. The experience gap is just real, and that's where a second pair of eyes helps
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

solid breakdown, the regression one especially. there's a sibling to your edge cases a QA pass can miss: the security bugs that never crash or throw. the AI builds a route that returns the right data in the demo and never checks the caller owns it, so changing one id in the url hands back someone else's record. same root cause you named. the question nobody prompted was who this should refuse, not just what could break it. that login-and-session area you flagged is usually where those access bugs sit too.

chalermpon

·
2 months ago
·
Reply
1. 1
  
  You are right and those are the most scared part because those are silent issue, no crash, no errors and nothing to show that something is wrong.
  The authorization gaps are particularly common in AI-built apps because the prompt usually describes what the feature should do, not who it should refuse.
  I did flag access control issues in that audit but kept them out of the post to avoid getting too technical, the session/login area was exactly where they happened.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
  1. 1
    
    That 'who it should refuse' line is the whole bug in one sentence. The prompt says build the feature, it never says deny everyone else, so the agent ships the happy path and stops. Login is just where it surfaces first, since that's the one place who-are-you gets decided.
    
    chalermpon
    
    ·
    2 months ago
    ·
    Reply
    1. 1
      
      Exactly, the agent optimizes for making it work, not for making it work only for the right person. Those are very different problems and only one of them gets prompted.
      
      redion
      
      ·
      2 months ago
      ·
      Reply
      1. 1
        
        Yeah. It is almost a separate pass from QA for that reason. Edge-case hunting asks what breaks, this one asks who you would let through. Same app, different muscle. Curious whether you fold that into the QAura pass or treat it as its own thing.
        
        chalermpon
        
        ·
        2 months ago
        ·
        Reply
        
        1
        
        Mostly fold it in, it's on my standard checklist. For apps handling payments or sensitive data I treat it as a separate pass though, it needs a different mindset to edge case testing
        
        redion
        
        ·
        a month ago
        ·
        Reply
        
        1
        
        Yeah, payments and sensitive data is where I'd draw that line too. The separate pass is really a separate question. Edge cases ask what breaks, access asks who you'd let through, and an app can pass every crash test and still fail that one cold. Curious how QAura runs that pass. Is it more of a checklist, or does it replay a request as the wrong user and see what comes back?
        
        chalermpon
        
        ·
        a month ago
        ·
        Reply
        
        1
        
        The access control pass is mostly manual and exploratory, I'll try to replay requests as a different user, tamper with IDs in URLs, test what happens when you access a route directly without going through the normal flow. More 'what can I get away with' than a fixed checklist.
        
        As for how QAura works overall: first I do a full exploratory audit with minimal context on purpose, that way I catch not just bugs but UX weak spots and things that feel off to a real user. Then up to two rounds of regression around the fixes. After that some clients keep us on retainer for weekly checks or a pass before each release.
        
        redion
        
        ·
        a month ago
        ·
        Reply
        
        1
        
        That's the right call, and not everyone names that line so cleanly. Common access gaps you can cover cold, but the moment it's payments or health data the cost of a miss jumps and it turns into a different job.
        
        That handoff is basically the work I do: a dedicated security pass on AI-built apps, the who-can-you-let-through question as its own thing, replaying as the wrong user to see what comes back. If it's ever useful, I'm glad to be the person you point those clients to when you flag one. You keep the relationship and the QA brain, they just get a specialist read where it counts. It runs both ways too, a lot of my conversations could use a real QA pass before anything else.
        
        Either way this has been one of the better threads I've had on here.
        
        chalermpon
        
        ·
        a month ago
        ·
        Reply
        
        1
        
        That referral idea makes a lot of sense honestly, clean split, same app, two different questions. I flag the security gaps, you go deep on them, client gets both covered. Happy to keep that in mind when those cases come up. And yes, one of the better threads on here for me too
        
        redion
        
        ·
        a month ago
        
        1
        
        Funny thing about going in cold on purpose: it's the same reason the access holes survive in the first place. The more you absorb the intended flow, the less your brain volunteers to step outside it, so minimal context is what keeps the "what can I get away with" muscle honest. Replaying as the wrong user is the move a fixed checklist can never fake. Feels like QAura and a dedicated security pass end up reading the same app with two different questions. Do you tend to keep the payments and sensitive-data ones in-house, or is that a point where you'd rather hand off?
        
        chalermpon
        
        ·
        a month ago
        ·
        Reply
        
        1
        
        Mostly keep it in-house, the access control checks I described cover the common gaps that show up in most apps. For anything more serious, like a fintech product or something handling sensitive medical data, I'd flag it clearly in the report and recommend a dedicated security specialist. I know what I'm good at and a full penetration test is a different skill set.
        
        redion
        
        ·
        a month ago
1

Interesting findings. The AI fixes exactly what you ask it to fix point really stands out. Speed is valuable, but without QA and regression testing, it's easy to accumulate hidden issues. Curious which category of bugs showed up most often in AI-built products.

Autonomy09

·
2 months ago
·
Reply
1. 1
  
  Edge cases by far, they made up the bulk of the critical issues. The pattern is consistent: AI builds the happy path really well, but nobody ever asked 'what happens when this goes wrong?' Empty inputs, dropped connections, unexpected data formats , these were everywhere. The second most common category was consistency issues across features, which makes sense given that each AI prompt is essentially a fresh context with no memory of decisions made elsewhere in the product.
  
  redion
  
  ·
  2 months ago
  ·
  Reply
1

Interesting.

The thing I'd be careful with is treating this as an AI-coding problem only.

A lot of what you're describing sounds like missing product decisions showing up as bugs.

The useful question may not be what AI broke, but which decisions nobody made before the code existed.

aryan_sinh

·
2 months ago
·
Reply
1. 1
  
  That's a fair point and you are right that a lot of edge cases come down to decisions that were never made, not AI limitations.
  
  The difference that I would add is that with traditional development those gaps are catch earlier with the back and forth with the developer, but with AI you can ship a complete looking product so fast that nobody ask "what happens if we change this", so missing decision stays hidden longer.
  So QA audit helps to catch those before clients do so.
  
  redion
  
  ·
  2 months ago
  ·
  Reply