TL;DR: Four months. Six AI distribution stacks. $400/month at peak. Zero signups attributable to any of them. Then I opened a spreadsheet, listed every distribution action I take in a typical week (40 rows), and labeled each one "AI can do this" or "AI cannot do this without killing the channel." 28 rows went to AI. 12 stayed mine. Output doubled, cost dropped to $19/month, and the 12 I kept are the only reason any of this converts.
For four months I bought every "AI distribution" promise on the timeline. Apollo for lead enrichment. Lemlist for sequences. An autonomous agent platform that prospects while you sleep. An LLM lead-scorer. An AI scheduler. A separate AI for founder voice replication.
Six stacks. $400/month at peak. Zero measurable signup lift on flowly.run, the productivity tool I ship for freelancers and solo founders.
I knew exactly when to stop. I had paid more for AI growth tools that month than I had collected in MRR from them.
One evening I closed every dashboard and opened a spreadsheet. I listed every distribution action I had taken that week. The list ran 40 rows. Then I labeled each row with one of two tags.
28 rows ended in the first column. 12 in the second. The 12 were the only rows that had ever produced a paying customer.
I had been paying $400/month to automate the 70% of distribution work that does not convert and ignoring the 30% that does.
These will look small. They are the entire engine.
Each of these failed when I delegated it. Three of them cost me actual customers.
The single most valuable file in my distribution setup is not a prompt. It is a one-page "never-write" list. New rules show up every week. Each rule came from reading a bad AI draft and tagging the exact line where my voice broke.
Sample lines:
Drafts produced with this list need a 30% edit. Drafts produced without it need a 70% edit, and at that point I am writing from scratch. The never-write list is the entire economic difference between "AI saves me time" and "AI costs me time."
This is what the 2026 vocabulary now calls context engineering at indie scale. You are not prompt-engineering. You are designing the information environment around the model so it can do the mechanical 70% and stay out of the 30% that needs skin in the game.
Five scripts. About 200 lines of Python total. Cron-driven. Slack and email for human approvals.
Total API spend: about $19/month. Drifts to $24 in heavy weeks. Founder hours pulled back into product work: about 14 per week.
The pitch responder ran on auto-send for 8 days in March. I had set a low-confidence threshold and trusted the queue. A journalist's follow-up questions to my original pitch went unread for 5 days while the script regenerated polite boilerplate. She stopped replying around message 4. The story she eventually ran skipped Flowly.
The fix that night: every outbound message blocks on my one-click approval. No exceptions. No thresholds. No "low-risk" auto-send buckets.
The cost of that 8-day mistake was one piece of press. The cost of leaving the auto-send running would have been all of them.
This is the version of context engineering nobody publishes. The rule is not "what should the model see." The rule is "what does the model see that, if it gets one thing wrong, ends a relationship the model cannot rebuild."
The reason I noticed the 12-versus-28 split at all is that I had been timing every distribution action inside the product I ship. Flowly is a single workspace for tasks, timers, and analytics for freelancers and solo founders who are tired of running four separate apps to answer one question: where did my week actually go. I had been running my own distribution inside it, the same way a freelancer tracks billable client work. The spreadsheet that started this rebuild came out of Flowly's analytics, not my head.
The lesson generalizes past the tool. If you cannot see the line between what AI did and what you did, you cannot price either one honestly. AI runs distribution. Flowly tells me whether running it was worth the 95 minutes a day I still own.
Do the 40-action sort before you buy another AI tool. List every distribution action you take in a week. Label each row "AI can" or "AI cannot without killing the channel." Then make the second column the only thing you spend founder hours on.
Mine is 28/12. Post yours in the comments. If anyone has a real 90/10 split working, with attribution that holds up, I will rebuild my stack to match. I want to be wrong about this. So far nobody has been.
Product: flowly.run. Free tier, 14-day reverse Pro trial, no card.
Most indie hackers who buy a $400 AI growth stack are paying to automate the 30% that converts and ignoring the 70% that bores them. They have it exactly backwards.
The 'AI can do this without killing the channel' framework is the real insight. Most founders fail at AI distribution because they let AI write the final draft and ship it. The leverage shows up when AI does retrieval, ranking, and first drafts, and you handle the final 20% that signals you actually wrote it. Running SocialPost.ai gave me the same lesson on the product side: customers will use AI for the 80%, but they want full control on the moments that touch their voice or their brand. Curious what happened the times you did let AI ship without edits.
The AI can draft this, never send this category feels like the missing piece in most AI workflow discussions. A lot of founders confuse speed with trust, but the trust layer is usually the actual business.
"Confusing speed with trust" is the cleanest summary of the auto-send failure mode I've read. The journalist story is exactly that mistake. The script was fast. The relationship was slow. I optimized for the wrong variable.
The draft-never-send bucket is also where I'd put anything going to someone with an audience larger than mine. The asymmetry is too high. One flat reply to the right person costs more than a month of volume.
I tried this for two months across a smaller stack — three tools instead of six — and burned roughly $140 before reaching the same audit you describe.
The two specific rows that wouldn't move from "AI cannot do this" for me, building a tiny iOS memo app solo: replies to Reddit comments where the OP is venting (the AI versions read flat even after voice cloning), and answering iPhone-specific questions in Apple subs where the second-best wording gets downvoted immediately. Everything I tried to push those into AI hands cost me karma faster than it bought reach.
What I'd add to your framework: a third bucket — "AI can draft this, never send this." That bucket quietly grew the longer I ran it. Curious whether the 12 you kept stayed stable, or whether some drifted into the AI column over time?
The third bucket is the right addition and I should have named it explicitly. "AI can draft, never send" is where my cold outreach to named journalists lives. The draft is useful as a structure check. It never ships as written. Calling it a two-column sort undersells the actual workflow.
The Reddit venting reply problem is one I recognize. The model produces the correct sentiment but misses the specific weight of the moment. It reads as someone who understood the complaint intellectually but wasn't in the room. That gap is unrecoverable with prompting in my experience — it's not a constraint problem, it's a presence problem.
On the 12 staying stable: some drifted, mostly in one direction. Two tasks that were firmly mine a year ago moved into the AI column after I got precise enough with constraints. None moved the other way. The tasks that stayed human got more entrenched over time, not less — because the cost of getting them wrong became clearer the longer I ran the stack.
This makes a lot of sense. I’ve seen good products fail just because they were too slow to reach users.
Automation definitely helps, but I feel the real challenge is keeping it personal.
How do you balance that?
The 12 tasks I kept are the entire answer to that. Personalization does not survive delegation — it just looks like personalization until the person on the other end clicks through and realizes nobody is home.
The balance I landed on: automate everything where the output is evaluated on accuracy. Keep everything where the output is evaluated on whether it sounds like a specific human who has read their work.
ran a similar audit, ended up with 9 tasks that had to stay mine. anything where the person could verify me in two clicks stayed human. the $400 for zero lift phase is almost universal.
The "verify me in two clicks" framing is sharper than how I had it. I was thinking about it as voice fidelity. You've named the actual risk: not that it sounds wrong, but that someone can check.
The $400 zero-lift phase being near-universal is the part I wish someone had told me before month one. Would not have stopped me but would have shortened it.
The inbox monitor row on my list looked similar. We built goffer.ai for newsletter writers and policy teams - it scans Congressional activity for keyword matches (bill introduced, committee vote, floor action) and sends alerts to Gmail or SMS.
The 28 part: scanning congress.gov, matching keywords, formatting the alert. Runs unattended.
The 12 part: deciding which keywords actually matter for your readers. We learned this early - users with 50 generic keywords got noise. Users with 5 precise ones got signals they wrote entire newsletters around.
The keyword selection cannot be delegated. It requires knowing your audience and your editorial angle. Same principle as your never-write list - the constraint lives upstream of the model, not inside it.
"The constraint lives upstream of the model, not inside it" is the cleaner formulation of what I was trying to say. Stealing that line.
The 50-versus-5 keywords finding is the exact failure mode I see when founders first build anything like this. More inputs feels like more coverage. It's just noise with extra steps. The model cannot tell you which keywords matter for your readers. It can only score against the ones you already chose correctly.
The dependency I'd add: keyword selection isn't a one-time decision. When the policy landscape shifts, a keyword that was low-signal for months becomes load-bearing overnight. That upstream call has to stay human — and it probably won't feel like a decision when it happens. It'll feel like "this alert seems more important this week." Which is exactly the judgment the model cannot replicate.
Read this with my coffee growing cold. The 28/12 ratio is almost
exactly what I land on every time I do the same exercise — and the
auto-send story is the specific bullet I dodged twice this year,
both times by luck.
The row I'd add to "cannot delegate": choosing which old thread to
revive vs let die. The model finds candidates fine. It's terrible at
the "is this still relevant 8 days later" call. I lost a real
conversation last month because a draft sat in my approval queue too
long — by the time I sent the polished reply, it landed as a thread
necromancer.
Your never-write list is the part of this post I keep re-reading.
The rule I keep adding to mine: "never write a sentence that combines
two strong claims into one." The model loves rhetorical stacking and
you can smell it the second you read it back.
One real question — is the 30% edit measured in words changed, or in
time spent vs writing from scratch? Those drift apart fast for me on
long-form.
The thread necromancer problem is real and I don't have a clean fix for it. My queue has a 48-hour expiry now — anything older gets auto-archived and I re-evaluate from scratch rather than ship a stale draft. It creates some waste but it's better than the alternative you described.
"Never write a sentence that combines two strong claims into one" is going straight into my list. You're right that you can smell it immediately. The model stacks claims because stacking sounds authoritative. It reads as generated the second you say it out loud.
On the 30% question: time, not words. Words changed is a bad proxy because the most important edits are often one line — the opener or the closer — and those take 30 seconds to change but represent 80% of the value. When I tracked words changed I convinced myself drafts were good that weren't. Time spent relative to writing from scratch is the honest number. For long-form specifically, I've found the 30% estimate holds on replies and short posts and falls apart completely on anything over 600 words, where it drifts closer to 50%.
Matches what I hit doing long-form. The tough sections for me were the ones with code blocks — the model gets the snippets right but the prose between them sounds like a tutorial generator, not a story. I ended up rewriting that connective stuff almost every time. Those sections easily blew past 50%. Pure prose chapters were closer to 35-40%, much nearer your number. The 48-hour expiry is a discipline I should adopt. My queue right now is more "whenever I get to it" which is exactly the failure mode you described. Curious if 48 hours works across the board or if some channels need it shorter — feels like X reactions probably want a 4-6 hour window before they stop being relevant.
48 hours is not universal. X is closer to 4-6 hours for anything reply-shaped — after that the thread has moved and your comment lands in a graveyard. HN is more forgiving, sometimes 24 hours, depending on whether the thread is still active on /front. Long-form comment threads on IH or similar can survive 48 hours because the decay curve is slower.
The code-block prose problem is one I haven't solved cleanly either. The connective tissue between technical sections is where the tutorial voice leaks in hardest. My current fix is a specific line in the never-write list: "never use 'now let's' or 'next we'll' as a transition." It catches the worst of it. The rest I still rewrite by hand.
The never-write list is the most underrated part of this. Everyone talks about prompts. Nobody talks about constraints. But constraints are what separate a draft that ships in 30% edit time versus one that has to be rebuilt from scratch.
The 28/12 split also maps to something I've noticed: AI excels at work where the output is reviewable in 10 seconds. If it takes longer to evaluate whether the AI did it right than to just do it yourself, you haven't gained time — you've just moved the bottleneck.
The March journalist story is the real lesson buried in here. Auto-send is never low-risk. One relationship lost to boilerplate is never recoverable. The human approval gate isn't friction — it's the product.
"If it takes longer to evaluate whether the AI did it right than to just do it yourself, you haven't gained time — you've just moved the bottleneck." That's the cleaner version of the test I was running implicitly and never wrote down. Adding it to the doc.
The 10-second reviewability threshold also explains why the never-write list matters more than the prompt. A good prompt makes the output better. A good constraint list makes the output faster to evaluate. Those are different problems and most people only solve the first one.
"The human approval gate isn't friction — it's the product" is exactly right and the part that took me the longest to internalize. I kept framing the approval step as overhead I'd eventually automate away. The journalist story is what made it permanent.
The never-write list is quietly the most important part of this post. Everyone obsesses over prompts and model selection but the constraint layer is where the actual time savings live. Without it you're just generating plausible-sounding text that still needs a full rewrite.
We hit the same wall building aisa.to (AI skills assessment through conversation). Early on we tried to let the model handle everything in the assessment flow. Turns out about 30% of the conversation requires judgment calls the model consistently gets wrong: when to push back on a vague answer, when someone is actually demonstrating skill vs just repeating something they read, when to change direction entirely. The rest is mechanical and AI handles it fine.
Your 28/12 split rings true. Most founders I talk to claim something closer to 90/10 but when you ask them to show attribution, the number falls apart fast. The honest split is always uglier than the vibes-based one.
One thing worth adding: the split isn't static. Tasks that were firmly in my "AI cannot" column six months ago have migrated over as I got better at writing constraints. The spreadsheet exercise is worth repeating quarterly.
The assessment case is a sharper version of the problem than distribution. In distribution a bad judgment call costs you a comment. In skills assessment a bad judgment call corrupts the actual output the product is selling. The 30% that stays human is load-bearing in a way mine isn't.
The 90/10 claim falling apart under attribution pressure is the most reliable pattern in these threads. The vibes-based number is always the marketing version. The spreadsheet number is always uglier and always more useful.
The quarterly repeat point is the one I'd underline. My split was 22/18 eight months ago. It's 28/12 now. Not because I automated more but because I got better at writing constraints that made previously-undelegatable tasks delegatable. The spreadsheet isn't a one-time audit, it's a calibration tool. Worth saying that more explicitly in the post.
Great breakdown. If you were starting over from scratch, what's the one thing you'd do earlier?
Written the never-write list. I had a voice doc for months that only said what to do. Drafts were 70% wrong. The day I added the never-write section, drafts became 70% right. That one page is worth more than any model upgrade. Start it on day one.
Can you make this concrete with one real example? I'd find it way more useful to see exactly what the AI does start to finish for a single HN comment that ships, including where you step in. The high-level pipeline makes sense, it's the actual handoffs I can't picture.
Cron at 09:00, 13:00, 17:00 pulls top 30 threads from HN /front and /newest. Haiku scores each thread 0-10 for relevance to my 7 stances and returns the top 5 with a one-line "why this thread" and a suggested reply angle. I read the 5 in 60 seconds and pick 1. Sonnet then generates 2 draft comments for that thread using my voice doc plus the thread context plus the chosen stance. I read both drafts, pick the better one, hand-edit the opener (always), hand-edit the closer (always), ship.
AI did about 4 minutes of work across the whole flow. I did about 6. The comment ships in 10 total minutes versus 25-30 if I were doing it fully manual. Multiply that 60% time saving across 5 channels and that is where the 14 hours per week of pulled-back founder time comes from.
Fair pushback. Most distribution threads stop at impressions and engagement because signups are where attribution gets messy fast.
In our case, yes — we do track signups, but I’d be lying if I said organic attribution is perfectly clean. The honest version is usually a mix of:
branded search lift over time
What we’ve consistently seen is that distribution compounds when the content is tightly connected to a problem the product actually solves. Random viral reach rarely converts. Repeated credibility in the same niche does.
One example: a focused distribution loop around operational pain points produced lower engagement than broad thought-leadership posts, but converted materially better because the audience intent was higher.
So I’d separate “content performance” from “business performance.” High-output content can create awareness, but signups usually come from:
And honestly, there are still gaps. Dark social, team shares, screenshots, Slack forwards, AI summaries, and word-of-mouth make clean funnels almost impossible now.
The mistake is pretending attribution is precise. The useful question is whether distribution is creating measurable business lift over time, even if the exact path is fuzzy.
Cron at 09:00, 13:00, 17:00 pulls top 30 threads from HN /front and /newest. Haiku scores each thread 0-10 for relevance to my 7 stances and returns the top 5 with a one-line "why this thread" and a suggested reply angle. I read the 5 in 60 seconds and pick 1. Sonnet then generates 2 draft comments for that thread using my voice doc plus the thread context plus the chosen stance. I read both drafts, pick the better one, hand-edit the opener (always), hand-edit the closer (always), ship.
AI did about 4 minutes of work across the whole flow. I did about 6. The comment ships in 10 total minutes versus 25-30 if I were doing it fully manual. Multiply that 60% time saving across 5 channels and that is where the 14 hours per week of pulled-back founder time comes from.
Genuine pushback here. Every distribution post I read measures output volume and engagement, then quietly assumes that becomes signups. Have you actually tied this stack to real signups, or is it outputs and vibes? If you have numbers I'd love to see how you attribute them, because organic attribution is notoriously messy and I'd rather hear the honest version with the gaps than a tidy funnel chart.
Both, honestly. Signups are the lagging metric I care about most and the one most resistant to clean attribution because organic distribution has a 30 to 90 day delay between first touch and conversion.
What I can measure: weekly output count, channel-level engagement (replies, upvotes, click-throughs to flowly.run), referrer reports from Umami, signup rate from each referrer over a 30-day window. The "hybrid-output conversion roughly 10x AI-only conversion" split I implied in the post comes from looking at click-to-signup on referrers I can tag and from comparing drafts shipped at 30% edit versus drafts shipped at 0% edit during one bad week.
What I cannot measure: the long-tail compound effect of consistent presence. A founder who saw 6 of my HN comments over 3 months and then signed up via a direct visit shows up as "organic, no referrer." That is the bulk of my signups. I assume the volume helps. The numbers agree but do not prove it.
If you build this stack, set up Umami or PostHog before you start, tag every link with UTM params, and accept that you will be flying half-blind on the long tail. That is the nature of organic distribution at this scale. Anyone selling you cleaner attribution is selling you fiction.
Really want to try this but I'm not a developer, no Python and definitely no Playwright. Is there a realistic version of this for non-technical founders, or is that a hard requirement? Would love to know where someone like me should even start.
Start with two scripts, not five. The two with the highest leverage are (1) a daily analytics digest that summarizes your traffic into one email and (2) a draft pack generator for one channel only. Pick the channel that costs you the most time per output.
You can build both in a weekend with Claude as your pair. The non-technical version is the same flow in n8n or Make.com. The pipeline matters more than the runtime. The hardest part is the voice doc, and that one you write by hand regardless.
The journalist failure made me wince, so thanks for writing it up instead of pretending the stack just works. That's the useful part of these posts. Has anything else broken in a similar way since you patched it? Trying to get a realistic sense of the failure surface before I build my own version.
Yes, smaller scars.
One: the thread scanner once recommended a Bluesky thread for engagement. I shipped a reply. The thread turned out to be a quote-post of a tragedy. I deleted within 4 minutes but a few people saw it. Now the scanner has a "sensitive content" pre-flag and the day's top 5 candidates skip anything tagged.
Two: the draft pack generator produced a reply that paraphrased a competitor's marketing copy almost word-for-word. I caught it because the closer was uncharacteristically smooth. The fix was a new line in the never-write list: never use any phrase that sounds like it was already used by a SaaS landing page.
Three: the inbox monitor once scored a podcast booking request as 2/10 because the founder voice doc did not include a "podcast guesting" stance. I missed the email. The fix was adding an eighth stance, then immediately collapsing it back to seven by merging two adjacent ones.
The pattern is the same: every failure produced one line of context engineering. The stack is mostly the accumulated failures of the founder, written down.
Has any platform actually flagged or deboosted your AI-assisted posts? That's honestly the one thing stopping me from setting this up.
Not that I can detect. The 2026 detection heuristics target unedited LLM output with characteristic structural tells (em dash density, three-bullet conclusions, "Here is the thing about X" openers). The 30% human edit removes those tells. The posts that get traction are always the ones where my edit pass adds a number, a specific personal example, or a contrarian line the model did not generate.
Maybe I missed it, but you reference these 7 stances over and over and never actually show what one looks like. That's genuinely the part I clicked through for. Can you break down the format of a single entry? And I'm curious why 7 specifically, since that feels oddly precise compared to just picking 5 or 10.
The doc is one page. Seven entries. Each entry has the same four fields.
Name. Two to four words. Mine include "single-tool stack undervalued," "AI removed design blocker not speed," "distribution is a feedback-signal problem," "founder voice is the asset."
First sentence template. The opener I have used enough times that I can identify the stance from the first 8 words.
Three bullets. The atomic claims this stance makes. Each bullet must be a complete idea, not a header.
One example reply that nailed it. A real comment or post of mine, copied verbatim. The example is the calibration target for the model.
The stance doc plus the never-write list is the entire context the LLM gets per task. Total prompt header: about 1,200 tokens. Per-request input adds another 300-800 depending on channel. Output 200-400. Cheap.
The reason 7 works and 15 does not: I cannot hold 15 stances in my head consistently. The model can. But the human approval step at the end will fail if I cannot recognize my own stance in the draft. Seven is the ceiling of what I can recognize at a glance.
Solid post, but I'm stuck on this part. I've used Apollo and Lemlist and they already handle outreach fine, so why hand-roll Python scripts for it? Genuinely asking what they were missing, because maintaining your own stack sounds like real overhead for a solo founder.
I did. For 4 months. They failed for the same reason most "AI distribution" SaaS products fail at indie scale: they are optimized for outbound B2B SDR workflows where the slow 30% is "personalized intro line" and the bulk 70% is "send 100 emails per day." My slow 30% is "decide whether this is worth shipping at all," which no SaaS exposes as a step. Those tools assume you already decided. I had not.
The Python stack is about 200 lines total across the 5 scripts. It gives me a seam where the human picks live. SaaS tools paper over that seam and charge for it. The seam is the entire product.
Nice writeup. Curious about the model side, are you running one model across the whole pipeline or swapping per step? And if you mix them, what made you land on that split instead of just defaulting to one provider?
Haiku for the scanning and scoring steps where I am triaging 50-100 candidates per day. Throughput and cost matter more than voice there. GPT-4o for short-form draft generation (replies, posts, pitches) because it tracks instructions tighter on the 60-word constraint and stops adding extra paragraphs. Claude Sonnet for long-form blog drafts and journalist pitch responses where voice fidelity is the load-bearing metric.
I rotate when a model starts drifting from my voice doc, which happens about every 8 weeks. The assignment above is correct as of this week. By next quarter it will probably look different.
$19/mo total? I'm paying more than that for Lemlist alone. Where does that actually go?
Roughly Claude Haiku for monitoring and scoring ($6), GPT-4o for thread relevance ranking and short-form drafts ($7), Claude Sonnet for blog first-drafts and pitch responses ($4), and about $1-2 in Playwright cloud minutes for form filling. Some months it nudges $24. It has not crossed $30.
The multi-tool problem nobody talks about: using Claude Code + Cursor + Copilot on the same project.
Each one starts fresh. Each one has different defaults. Each one will make a different decision about the same architectural question — and none of them will tell you they disagree with the others.
Six months in, the codebase reflects three different opinions about how to structure the same thing.
The fix: a CLAUDEmd file (works as Cursor rules too) that defines the non-negotiables before any tool touches the code. Stack, patterns, what's forbidden. All tools read the same source of truth.
It's not about which tool is better. It's about making them agree with each other.
Anyone else running multiple AI tools on the same codebase? How do you keep them consistent?
The CLAUDE.md as shared source of truth is the code equivalent of my never-write list. Same principle: the constraint lives upstream of the model, not inside it. Without it each tool optimizes locally and the codebase accumulates three silent opinions about the same problem.
The distribution version of your problem is running the same pipeline across Claude, GPT-4o, and Haiku without a shared voice doc. Each model drifts toward its own defaults. The output sounds like three different founders. The fix is identical to yours — one document all models read before touching anything.
This comment was deleted 5 hours ago.