23
60 Comments

AI runs 70% of my distribution. The exact stack.

TL;DR: Four months. Six AI distribution stacks. $400/month at peak. Zero signups attributable to any of them. Then I opened a spreadsheet, listed every distribution action I take in a typical week (40 rows), and labeled each one "AI can do this" or "AI cannot do this without killing the channel." 28 rows went to AI. 12 stayed mine. Output doubled, cost dropped to $19/month, and the 12 I kept are the only reason any of this converts.

The $1,600 mistake

For four months I bought every "AI distribution" promise on the timeline. Apollo for lead enrichment. Lemlist for sequences. An autonomous agent platform that prospects while you sleep. An LLM lead-scorer. An AI scheduler. A separate AI for founder voice replication.

Six stacks. $400/month at peak. Zero measurable signup lift on flowly.run, the productivity tool I ship for freelancers and solo founders.

I knew exactly when to stop. I had paid more for AI growth tools that month than I had collected in MRR from them.

The spreadsheet that fixed it

One evening I closed every dashboard and opened a spreadsheet. I listed every distribution action I had taken that week. The list ran 40 rows. Then I labeled each row with one of two tags.

  • AI can do this. Scanning, scoring, drafting, scheduling, summarizing.
  • AI cannot do this without killing the channel. Selection, tone, decision, customer reply, cold outreach.

28 rows ended in the first column. 12 in the second. The 12 were the only rows that had ever produced a paying customer.

I had been paying $400/month to automate the 70% of distribution work that does not convert and ignoring the 30% that does.

The 12 I cannot delegate

These will look small. They are the entire engine.

  1. The final 30% edit of every shipped reply.
  2. Picking which 1 or 2 of the daily AI-generated drafts are actually worth shipping.
  3. Founder positioning calls on what stance to take this month.
  4. Replying to any customer DM or email. Every one. Always.
  5. The hand-edit of the opener and closer of any long-form post.
  6. Cold outreach to specific named humans (creators, journalists, podcasters).
  7. Any reply that names another person by handle.
  8. Comments under my own posts.
  9. Any time I am positioning against a named competitor.
  10. Any pitch that requires reading 3+ pieces of the recipient's prior work.
  11. The choice of which channel to retire when a metric drops.
  12. The closing line of every reply. AI cannot land a closer.

Each of these failed when I delegated it. Three of them cost me actual customers.

The list that makes the rest of the stack work

The single most valuable file in my distribution setup is not a prompt. It is a one-page "never-write" list. New rules show up every week. Each rule came from reading a bad AI draft and tagging the exact line where my voice broke.

Sample lines:

  • Never start a reply with "Great question."
  • Never use "leverage," "unlock," "synergy," "in the trenches."
  • Never end a comment with a question if the post is already over 60 words.
  • Never name the product in the first 80% of any reply.
  • Never quote a statistic without naming the source.
  • Never use an emoji unless quoting someone else.
  • Never write a sentence over 25 words in a draft for X or HN.
  • Never close with "Hope this helps."

Drafts produced with this list need a 30% edit. Drafts produced without it need a 70% edit, and at that point I am writing from scratch. The never-write list is the entire economic difference between "AI saves me time" and "AI costs me time."

This is what the 2026 vocabulary now calls context engineering at indie scale. You are not prompt-engineering. You are designing the information environment around the model so it can do the mechanical 70% and stay out of the 30% that needs skin in the game.

The $19 stack

Five scripts. About 200 lines of Python total. Cron-driven. Slack and email for human approvals.

  1. Inbox monitor. Polls Gmail IMAP for journalist platform queries every 2 hours. Each query gets scored by Claude Haiku against my 7 stances. Ranked list lands in Slack. I read it in 90 seconds and pick 0 to 2.
  2. Thread scanner. Pulls HN /front and /newest, my X home, my Bluesky timeline 3x daily. Returns top 5 candidates with a one-line "why this thread" and a suggested reply angle.
  3. Draft pack generator. End-of-day script produces 5 ready-to-edit drafts across 5 channels using my founder voice doc plus the never-write list. I edit roughly 30% of every draft before shipping.
  4. Daily digest. Pulls Umami plus Flowly analytics. Emails a 9pm summary of what shipped, what landed, what converted. 2 minutes to read.
  5. Pitch responder. Drafts journalist replies, 60-word constraint, founder voice. Always blocks on my one-click approval. Always.

Total API spend: about $19/month. Drifts to $24 in heavy weeks. Founder hours pulled back into product work: about 14 per week.

The day I almost killed a story

The pitch responder ran on auto-send for 8 days in March. I had set a low-confidence threshold and trusted the queue. A journalist's follow-up questions to my original pitch went unread for 5 days while the script regenerated polite boilerplate. She stopped replying around message 4. The story she eventually ran skipped Flowly.

The fix that night: every outbound message blocks on my one-click approval. No exceptions. No thresholds. No "low-risk" auto-send buckets.

The cost of that 8-day mistake was one piece of press. The cost of leaving the auto-send running would have been all of them.

This is the version of context engineering nobody publishes. The rule is not "what should the model see." The rule is "what does the model see that, if it gets one thing wrong, ends a relationship the model cannot rebuild."

Where Flowly fits

The reason I noticed the 12-versus-28 split at all is that I had been timing every distribution action inside the product I ship. Flowly is a single workspace for tasks, timers, and analytics for freelancers and solo founders who are tired of running four separate apps to answer one question: where did my week actually go. I had been running my own distribution inside it, the same way a freelancer tracks billable client work. The spreadsheet that started this rebuild came out of Flowly's analytics, not my head.

The lesson generalizes past the tool. If you cannot see the line between what AI did and what you did, you cannot price either one honestly. AI runs distribution. Flowly tells me whether running it was worth the 95 minutes a day I still own.

One ask, one bet

Do the 40-action sort before you buy another AI tool. List every distribution action you take in a week. Label each row "AI can" or "AI cannot without killing the channel." Then make the second column the only thing you spend founder hours on.

Mine is 28/12. Post yours in the comments. If anyone has a real 90/10 split working, with attribution that holds up, I will rebuild my stack to match. I want to be wrong about this. So far nobody has been.

Product: flowly.run. Free tier, 14-day reverse Pro trial, no card.

Most indie hackers who buy a $400 AI growth stack are paying to automate the 30% that converts and ignoring the 70% that bores them. They have it exactly backwards.

on May 20, 2026
  1. 2

    It's interesting that you found the tasks you kept in-house were the reason for conversions, which suggests that the human touch is still essential for driving meaningful engagement. I'd love to know more about the types of tasks that fell into the "AI cannot do this without killing the channel" category, as this could provide valuable insight into where AI can complement human effort without replacing it. What specific characteristics or requirements did these tasks have that made them unsuitable for automation?

    1. 1

      The common thread: the recipient could verify a real person was behind it in two clicks. Cold outreach to named journalists, replies mentioning someone by handle, comments under my own posts. All of them have a human on the other end who notices if nobody's home. That's the characteristic — not complexity, not length, but verifiability.

  2. 2

    the 28/12 split is the real lesson. we see the same pattern building agent systems for clients:

    • saas dashboard: 70% agent-supervised, 30% senior (architecture, hard logic)
    • bug fix sprint: 20% agent, 80% senior (reading existing code is judgment-heavy)
    • llm build: 50/50 (eval design = human, infra = agent)

    what stays human is consistent: judgment under partial info about a specific person. dm tone, the "is this worth shipping" call, opener of cold outreach. AI cold-starts on fresh context; senior judgment lives there.

    re-run the audit every 6 months tho, the line moves as AI gets stronger 🤷

    1. 1

      The task-type breakdown is the useful addition here — the split isn't a fixed number, it's a function of how judgment-heavy the work is. Bug fix sprint being 80% human because reading existing code requires knowing what the original author was thinking is exactly the pattern. Fresh context is the constraint; the model has none of it.

      The 6-month re-audit point matches what I said to someone else in this thread — my split was 22/18 eight months ago, now 28/12, not because I automated more but because better constraints made previously-undelegatable tasks delegatable. The line moves, but so far only in one direction.

  3. 2

    The 40-row sort is applied BI — you built an attribution model for your own distribution, and the signal is cleaner than most startup analytics dashboards I've seen.

    The daily digest piece is worth expanding. Pulling Umami + Flowly into a 9pm email is smart, but the next useful step is a 7-day rolling column alongside the daily number. Single-day snapshots often surface the wrong winner because distribution has a lag between action and conversion. A reply you sent Tuesday might show up as a signup Friday, and without the rolling window it looks like Friday's activity drove it.

    The 28/12 framing is the same principle I'd give any startup building their first data stack: instrument the inputs that have causal influence on outcomes, not just the ones that are easy to measure. You've been doing it instinctively — the spreadsheet just made it visible.

    1. 1

      The 7-day rolling column is the right fix and I haven't built it yet. You're describing exactly the attribution hole I paper over with "organic distribution has a 30-90 day lag" — which is true, but it's also a convenient excuse not to instrument it better. A rolling window would at least surface whether Tuesday's reply or Thursday's thread drove Friday's signup, even if it can't close the loop on the long-tail direct visits.

      The "instrument inputs with causal influence" framing is sharper than how I had it. I fell into the easy-to-measure trap for the first two months — tracking post count and engagement because Umami makes that effortless, ignoring the harder question of which specific action in which channel preceded a conversion. The spreadsheet made the input list visible. The rolling window would make the lag visible. Those are two different problems and I've only solved the first one.

      Adding it to the next digest iteration. Appreciate the specific build note rather than just "attribution is hard."

  4. 2

    At this moment in time distribution is harder to achieve than just building the product, you can build the best tool but if no one sees it, it becomes the world's best kept secret, very efficient distribution system

    1. 1

      Fully agree — and I'd go further: distribution is now the harder engineering problem. Building is tractable. Getting seen by the right person at the right moment is not. The "world's best kept secret" failure mode has killed more good products than bad code ever has.

  5. 2

    The never-write list is the thing that clicked for me here. I've been building Markey (an AI launch tool) and ran into the exact same thing - the AI outputs that flopped were the ones where I skipped the constraint step. The $19 vs $400 story is also a good reminder that output volume isn't the metric. Thanks for sharing the actual scripts breakdown, that's rare.

    1. 1

      The constraint step being skippable is the exact failure mode — it feels optional until you've read enough flat drafts to price what skipping it actually costs. Output volume is the metric that feels like progress. Constraint quality is the metric that produces it. Glad the scripts breakdown was useful; most posts stop at the framework and leave out the part you can actually build from.

  6. 2

    The never-write list is the real differentiator here. Most AI distribution fails because of voice inconsistency, not technical execution. The constraint-driven approach you take keeps AI from drifting into generic marketing language. One addition: track which drafts pass your manual gate vs which get rejected and why. That rejection data is better training material than any fine-tuning dataset. Curious if you've built any feedback loop from those manual decisions back into the prompt templates.

    1. 1

      Haven't built a formal feedback loop yet, but the rejection tracking point is the right call. Right now new never-write rules show up when I read a bad draft and tag the exact line where the voice broke — which is manual and lossy. Turning rejections into structured data and routing the patterns back into the prompt header is the obvious next step I've been doing informally. The rejection log as training signal is a better frame than fine-tuning because it stays interpretable — you can read the list and know why each rule exists. Adding a rejection reason field to the approval queue this week.

  7. 2

    The 'AI can do this without killing the channel' framework is the real insight. Most founders fail at AI distribution because they let AI write the final draft and ship it. The leverage shows up when AI does retrieval, ranking, and first drafts, and you handle the final 20% that signals you actually wrote it. Running SocialPost.ai gave me the same lesson on the product side: customers will use AI for the 80%, but they want full control on the moments that touch their voice or their brand. Curious what happened the times you did let AI ship without edits.

    1. 1

      The unedited weeks are the clearest data I have. Engagement held roughly flat — likes, upvotes, replies looked normal. Signups dropped. The posts that convert aren't the ones that sound correct, they're the ones where a specific line lands as something only someone with skin in the game would write. Unedited drafts pass the skimming test and fail the "do I trust this person" test. The audience that converts is exactly the audience that can tell the difference.

  8. 2

    The AI can draft this, never send this category feels like the missing piece in most AI workflow discussions. A lot of founders confuse speed with trust, but the trust layer is usually the actual business.

    1. 1

      "Confusing speed with trust" is the cleanest summary of the auto-send failure mode I've read. The journalist story is exactly that mistake. The script was fast. The relationship was slow. I optimized for the wrong variable.

      The draft-never-send bucket is also where I'd put anything going to someone with an audience larger than mine. The asymmetry is too high. One flat reply to the right person costs more than a month of volume.

  9. 2

    I tried this for two months across a smaller stack — three tools instead of six — and burned roughly $140 before reaching the same audit you describe.

    The two specific rows that wouldn't move from "AI cannot do this" for me, building a tiny iOS memo app solo: replies to Reddit comments where the OP is venting (the AI versions read flat even after voice cloning), and answering iPhone-specific questions in Apple subs where the second-best wording gets downvoted immediately. Everything I tried to push those into AI hands cost me karma faster than it bought reach.

    What I'd add to your framework: a third bucket — "AI can draft this, never send this." That bucket quietly grew the longer I ran it. Curious whether the 12 you kept stayed stable, or whether some drifted into the AI column over time?

    1. 1

      The third bucket is the right addition and I should have named it explicitly. "AI can draft, never send" is where my cold outreach to named journalists lives. The draft is useful as a structure check. It never ships as written. Calling it a two-column sort undersells the actual workflow.

      The Reddit venting reply problem is one I recognize. The model produces the correct sentiment but misses the specific weight of the moment. It reads as someone who understood the complaint intellectually but wasn't in the room. That gap is unrecoverable with prompting in my experience — it's not a constraint problem, it's a presence problem.

      On the 12 staying stable: some drifted, mostly in one direction. Two tasks that were firmly mine a year ago moved into the AI column after I got precise enough with constraints. None moved the other way. The tasks that stayed human got more entrenched over time, not less — because the cost of getting them wrong became clearer the longer I ran the stack.

  10. 2

    This makes a lot of sense. I’ve seen good products fail just because they were too slow to reach users.
    Automation definitely helps, but I feel the real challenge is keeping it personal.
    How do you balance that?

    1. 1

      The 12 tasks I kept are the entire answer to that. Personalization does not survive delegation — it just looks like personalization until the person on the other end clicks through and realizes nobody is home.

      The balance I landed on: automate everything where the output is evaluated on accuracy. Keep everything where the output is evaluated on whether it sounds like a specific human who has read their work.

  11. 2

    ran a similar audit, ended up with 9 tasks that had to stay mine. anything where the person could verify me in two clicks stayed human. the $400 for zero lift phase is almost universal.

    1. 1

      The "verify me in two clicks" framing is sharper than how I had it. I was thinking about it as voice fidelity. You've named the actual risk: not that it sounds wrong, but that someone can check.

      The $400 zero-lift phase being near-universal is the part I wish someone had told me before month one. Would not have stopped me but would have shortened it.

      1. 1

        yeah, had them merged too. voice fidelity's upstream - the lookup is the gate. you can nail the tone and still lose someone who actually checks.

        1. 1

          Exactly right. Voice fidelity is table stakes — the lookup is the actual test. A journalist or podcaster who likes your reply and checks your profile in 10 seconds will see whether the last 20 posts sound like the same person. If they don't, the reply didn't matter.

  12. 2

    The inbox monitor row on my list looked similar. We built goffer.ai for newsletter writers and policy teams - it scans Congressional activity for keyword matches (bill introduced, committee vote, floor action) and sends alerts to Gmail or SMS.

    The 28 part: scanning congress.gov, matching keywords, formatting the alert. Runs unattended.

    The 12 part: deciding which keywords actually matter for your readers. We learned this early - users with 50 generic keywords got noise. Users with 5 precise ones got signals they wrote entire newsletters around.

    The keyword selection cannot be delegated. It requires knowing your audience and your editorial angle. Same principle as your never-write list - the constraint lives upstream of the model, not inside it.

    1. 1

      "The constraint lives upstream of the model, not inside it" is the cleaner formulation of what I was trying to say. Stealing that line.

      The 50-versus-5 keywords finding is the exact failure mode I see when founders first build anything like this. More inputs feels like more coverage. It's just noise with extra steps. The model cannot tell you which keywords matter for your readers. It can only score against the ones you already chose correctly.

      The dependency I'd add: keyword selection isn't a one-time decision. When the policy landscape shifts, a keyword that was low-signal for months becomes load-bearing overnight. That upstream call has to stay human — and it probably won't feel like a decision when it happens. It'll feel like "this alert seems more important this week." Which is exactly the judgment the model cannot replicate.

  13. 2

    Read this with my coffee growing cold. The 28/12 ratio is almost
    exactly what I land on every time I do the same exercise — and the
    auto-send story is the specific bullet I dodged twice this year,
    both times by luck.
    The row I'd add to "cannot delegate": choosing which old thread to
    revive vs let die. The model finds candidates fine. It's terrible at
    the "is this still relevant 8 days later" call. I lost a real
    conversation last month because a draft sat in my approval queue too
    long — by the time I sent the polished reply, it landed as a thread
    necromancer.
    Your never-write list is the part of this post I keep re-reading.
    The rule I keep adding to mine: "never write a sentence that combines
    two strong claims into one." The model loves rhetorical stacking and
    you can smell it the second you read it back.
    One real question — is the 30% edit measured in words changed, or in
    time spent vs writing from scratch? Those drift apart fast for me on
    long-form.

    1. 1

      The thread necromancer problem is real and I don't have a clean fix for it. My queue has a 48-hour expiry now — anything older gets auto-archived and I re-evaluate from scratch rather than ship a stale draft. It creates some waste but it's better than the alternative you described.

      "Never write a sentence that combines two strong claims into one" is going straight into my list. You're right that you can smell it immediately. The model stacks claims because stacking sounds authoritative. It reads as generated the second you say it out loud.

      On the 30% question: time, not words. Words changed is a bad proxy because the most important edits are often one line — the opener or the closer — and those take 30 seconds to change but represent 80% of the value. When I tracked words changed I convinced myself drafts were good that weren't. Time spent relative to writing from scratch is the honest number. For long-form specifically, I've found the 30% estimate holds on replies and short posts and falls apart completely on anything over 600 words, where it drifts closer to 50%.

      1. 1

        Matches what I hit doing long-form. The tough sections for me were the ones with code blocks — the model gets the snippets right but the prose between them sounds like a tutorial generator, not a story. I ended up rewriting that connective stuff almost every time. Those sections easily blew past 50%. Pure prose chapters were closer to 35-40%, much nearer your number. The 48-hour expiry is a discipline I should adopt. My queue right now is more "whenever I get to it" which is exactly the failure mode you described. Curious if 48 hours works across the board or if some channels need it shorter — feels like X reactions probably want a 4-6 hour window before they stop being relevant.

        1. 1

          48 hours is not universal. X is closer to 4-6 hours for anything reply-shaped — after that the thread has moved and your comment lands in a graveyard. HN is more forgiving, sometimes 24 hours, depending on whether the thread is still active on /front. Long-form comment threads on IH or similar can survive 48 hours because the decay curve is slower.

          The code-block prose problem is one I haven't solved cleanly either. The connective tissue between technical sections is where the tutorial voice leaks in hardest. My current fix is a specific line in the never-write list: "never use 'now let's' or 'next we'll' as a transition." It catches the worst of it. The rest I still rewrite by hand.

          1. 1

            "Never use now let's or next we'll" is going straight into my list— that's exactly the failure mode I was rewriting around without naming. Stealing it. The X 4-6h matches what I was suspecting. The lesson I'm taking — a late reply on a fast channel is worse than no reply at all. It signals you weren't paying attention.
            Appreciated this whole exchange.

            1. 1

              Same. The channel-specific decay curves are the part most queue systems ignore entirely — one expiry rule across all channels is almost as bad as no rule. The "late reply signals you weren't paying attention" framing is exactly right and I hadn't named it that cleanly before this thread.

  14. 2

    The never-write list is the most underrated part of this. Everyone talks about prompts. Nobody talks about constraints. But constraints are what separate a draft that ships in 30% edit time versus one that has to be rebuilt from scratch.

    The 28/12 split also maps to something I've noticed: AI excels at work where the output is reviewable in 10 seconds. If it takes longer to evaluate whether the AI did it right than to just do it yourself, you haven't gained time — you've just moved the bottleneck.

    The March journalist story is the real lesson buried in here. Auto-send is never low-risk. One relationship lost to boilerplate is never recoverable. The human approval gate isn't friction — it's the product.

    1. 1

      "If it takes longer to evaluate whether the AI did it right than to just do it yourself, you haven't gained time — you've just moved the bottleneck." That's the cleaner version of the test I was running implicitly and never wrote down. Adding it to the doc.

      The 10-second reviewability threshold also explains why the never-write list matters more than the prompt. A good prompt makes the output better. A good constraint list makes the output faster to evaluate. Those are different problems and most people only solve the first one.

      "The human approval gate isn't friction — it's the product" is exactly right and the part that took me the longest to internalize. I kept framing the approval step as overhead I'd eventually automate away. The journalist story is what made it permanent.

  15. 2

    The never-write list is quietly the most important part of this post. Everyone obsesses over prompts and model selection but the constraint layer is where the actual time savings live. Without it you're just generating plausible-sounding text that still needs a full rewrite.

    We hit the same wall building aisa.to (AI skills assessment through conversation). Early on we tried to let the model handle everything in the assessment flow. Turns out about 30% of the conversation requires judgment calls the model consistently gets wrong: when to push back on a vague answer, when someone is actually demonstrating skill vs just repeating something they read, when to change direction entirely. The rest is mechanical and AI handles it fine.

    Your 28/12 split rings true. Most founders I talk to claim something closer to 90/10 but when you ask them to show attribution, the number falls apart fast. The honest split is always uglier than the vibes-based one.

    One thing worth adding: the split isn't static. Tasks that were firmly in my "AI cannot" column six months ago have migrated over as I got better at writing constraints. The spreadsheet exercise is worth repeating quarterly.

    1. 1

      The assessment case is a sharper version of the problem than distribution. In distribution a bad judgment call costs you a comment. In skills assessment a bad judgment call corrupts the actual output the product is selling. The 30% that stays human is load-bearing in a way mine isn't.

      The 90/10 claim falling apart under attribution pressure is the most reliable pattern in these threads. The vibes-based number is always the marketing version. The spreadsheet number is always uglier and always more useful.

      The quarterly repeat point is the one I'd underline. My split was 22/18 eight months ago. It's 28/12 now. Not because I automated more but because I got better at writing constraints that made previously-undelegatable tasks delegatable. The spreadsheet isn't a one-time audit, it's a calibration tool. Worth saying that more explicitly in the post.

  16. 2

    Great breakdown. If you were starting over from scratch, what's the one thing you'd do earlier?

    1. 1

      Written the never-write list. I had a voice doc for months that only said what to do. Drafts were 70% wrong. The day I added the never-write section, drafts became 70% right. That one page is worth more than any model upgrade. Start it on day one.

  17. 2

    Can you make this concrete with one real example? I'd find it way more useful to see exactly what the AI does start to finish for a single HN comment that ships, including where you step in. The high-level pipeline makes sense, it's the actual handoffs I can't picture.

    1. 1

      Cron at 09:00, 13:00, 17:00 pulls top 30 threads from HN /front and /newest. Haiku scores each thread 0-10 for relevance to my 7 stances and returns the top 5 with a one-line "why this thread" and a suggested reply angle. I read the 5 in 60 seconds and pick 1. Sonnet then generates 2 draft comments for that thread using my voice doc plus the thread context plus the chosen stance. I read both drafts, pick the better one, hand-edit the opener (always), hand-edit the closer (always), ship.

      AI did about 4 minutes of work across the whole flow. I did about 6. The comment ships in 10 total minutes versus 25-30 if I were doing it fully manual. Multiply that 60% time saving across 5 channels and that is where the 14 hours per week of pulled-back founder time comes from.

      1. 1

        Fair pushback. Most distribution threads stop at impressions and engagement because signups are where attribution gets messy fast.

        In our case, yes — we do track signups, but I’d be lying if I said organic attribution is perfectly clean. The honest version is usually a mix of:

        1. direct attribution (UTMs, landing pages, referral paths)
        2. assisted conversions (people see multiple posts before converting)
          branded search lift over time
        3. qualitative signals like inbound mentions and “saw your post” demos

        What we’ve consistently seen is that distribution compounds when the content is tightly connected to a problem the product actually solves. Random viral reach rarely converts. Repeated credibility in the same niche does.

        One example: a focused distribution loop around operational pain points produced lower engagement than broad thought-leadership posts, but converted materially better because the audience intent was higher.

        So I’d separate “content performance” from “business performance.” High-output content can create awareness, but signups usually come from:

        1. message-market fit
        2. repeated exposure
        3. clear next-step friction reduction

        And honestly, there are still gaps. Dark social, team shares, screenshots, Slack forwards, AI summaries, and word-of-mouth make clean funnels almost impossible now.

        The mistake is pretending attribution is precise. The useful question is whether distribution is creating measurable business lift over time, even if the exact path is fuzzy.

        1. 1

          Cron at 09:00, 13:00, 17:00 pulls top 30 threads from HN /front and /newest. Haiku scores each thread 0-10 for relevance to my 7 stances and returns the top 5 with a one-line "why this thread" and a suggested reply angle. I read the 5 in 60 seconds and pick 1. Sonnet then generates 2 draft comments for that thread using my voice doc plus the thread context plus the chosen stance. I read both drafts, pick the better one, hand-edit the opener (always), hand-edit the closer (always), ship.

          AI did about 4 minutes of work across the whole flow. I did about 6. The comment ships in 10 total minutes versus 25-30 if I were doing it fully manual. Multiply that 60% time saving across 5 channels and that is where the 14 hours per week of pulled-back founder time comes from.

  18. 2

    Genuine pushback here. Every distribution post I read measures output volume and engagement, then quietly assumes that becomes signups. Have you actually tied this stack to real signups, or is it outputs and vibes? If you have numbers I'd love to see how you attribute them, because organic attribution is notoriously messy and I'd rather hear the honest version with the gaps than a tidy funnel chart.

    1. 1

      Both, honestly. Signups are the lagging metric I care about most and the one most resistant to clean attribution because organic distribution has a 30 to 90 day delay between first touch and conversion.

      What I can measure: weekly output count, channel-level engagement (replies, upvotes, click-throughs to flowly.run), referrer reports from Umami, signup rate from each referrer over a 30-day window. The "hybrid-output conversion roughly 10x AI-only conversion" split I implied in the post comes from looking at click-to-signup on referrers I can tag and from comparing drafts shipped at 30% edit versus drafts shipped at 0% edit during one bad week.

      What I cannot measure: the long-tail compound effect of consistent presence. A founder who saw 6 of my HN comments over 3 months and then signed up via a direct visit shows up as "organic, no referrer." That is the bulk of my signups. I assume the volume helps. The numbers agree but do not prove it.

      If you build this stack, set up Umami or PostHog before you start, tag every link with UTM params, and accept that you will be flying half-blind on the long tail. That is the nature of organic distribution at this scale. Anyone selling you cleaner attribution is selling you fiction.

  19. 2

    Really want to try this but I'm not a developer, no Python and definitely no Playwright. Is there a realistic version of this for non-technical founders, or is that a hard requirement? Would love to know where someone like me should even start.

    1. 1

      Start with two scripts, not five. The two with the highest leverage are (1) a daily analytics digest that summarizes your traffic into one email and (2) a draft pack generator for one channel only. Pick the channel that costs you the most time per output.

      You can build both in a weekend with Claude as your pair. The non-technical version is the same flow in n8n or Make.com. The pipeline matters more than the runtime. The hardest part is the voice doc, and that one you write by hand regardless.

  20. 2

    The journalist failure made me wince, so thanks for writing it up instead of pretending the stack just works. That's the useful part of these posts. Has anything else broken in a similar way since you patched it? Trying to get a realistic sense of the failure surface before I build my own version.

    1. 1

      Yes, smaller scars.

      One: the thread scanner once recommended a Bluesky thread for engagement. I shipped a reply. The thread turned out to be a quote-post of a tragedy. I deleted within 4 minutes but a few people saw it. Now the scanner has a "sensitive content" pre-flag and the day's top 5 candidates skip anything tagged.

      Two: the draft pack generator produced a reply that paraphrased a competitor's marketing copy almost word-for-word. I caught it because the closer was uncharacteristically smooth. The fix was a new line in the never-write list: never use any phrase that sounds like it was already used by a SaaS landing page.

      Three: the inbox monitor once scored a podcast booking request as 2/10 because the founder voice doc did not include a "podcast guesting" stance. I missed the email. The fix was adding an eighth stance, then immediately collapsing it back to seven by merging two adjacent ones.

      The pattern is the same: every failure produced one line of context engineering. The stack is mostly the accumulated failures of the founder, written down.

  21. 2

    Has any platform actually flagged or deboosted your AI-assisted posts? That's honestly the one thing stopping me from setting this up.

    1. 1

      Not that I can detect. The 2026 detection heuristics target unedited LLM output with characteristic structural tells (em dash density, three-bullet conclusions, "Here is the thing about X" openers). The 30% human edit removes those tells. The posts that get traction are always the ones where my edit pass adds a number, a specific personal example, or a contrarian line the model did not generate.

  22. 2

    Maybe I missed it, but you reference these 7 stances over and over and never actually show what one looks like. That's genuinely the part I clicked through for. Can you break down the format of a single entry? And I'm curious why 7 specifically, since that feels oddly precise compared to just picking 5 or 10.

    1. 1

      The doc is one page. Seven entries. Each entry has the same four fields.

      Name. Two to four words. Mine include "single-tool stack undervalued," "AI removed design blocker not speed," "distribution is a feedback-signal problem," "founder voice is the asset."
      First sentence template. The opener I have used enough times that I can identify the stance from the first 8 words.
      Three bullets. The atomic claims this stance makes. Each bullet must be a complete idea, not a header.
      One example reply that nailed it. A real comment or post of mine, copied verbatim. The example is the calibration target for the model.
      The stance doc plus the never-write list is the entire context the LLM gets per task. Total prompt header: about 1,200 tokens. Per-request input adds another 300-800 depending on channel. Output 200-400. Cheap.

      The reason 7 works and 15 does not: I cannot hold 15 stances in my head consistently. The model can. But the human approval step at the end will fail if I cannot recognize my own stance in the draft. Seven is the ceiling of what I can recognize at a glance.

  23. 2

    Solid post, but I'm stuck on this part. I've used Apollo and Lemlist and they already handle outreach fine, so why hand-roll Python scripts for it? Genuinely asking what they were missing, because maintaining your own stack sounds like real overhead for a solo founder.

    1. 1

      I did. For 4 months. They failed for the same reason most "AI distribution" SaaS products fail at indie scale: they are optimized for outbound B2B SDR workflows where the slow 30% is "personalized intro line" and the bulk 70% is "send 100 emails per day." My slow 30% is "decide whether this is worth shipping at all," which no SaaS exposes as a step. Those tools assume you already decided. I had not.

      The Python stack is about 200 lines total across the 5 scripts. It gives me a seam where the human picks live. SaaS tools paper over that seam and charge for it. The seam is the entire product.

  24. 2

    Nice writeup. Curious about the model side, are you running one model across the whole pipeline or swapping per step? And if you mix them, what made you land on that split instead of just defaulting to one provider?

    1. 1

      Haiku for the scanning and scoring steps where I am triaging 50-100 candidates per day. Throughput and cost matter more than voice there. GPT-4o for short-form draft generation (replies, posts, pitches) because it tracks instructions tighter on the 60-word constraint and stops adding extra paragraphs. Claude Sonnet for long-form blog drafts and journalist pitch responses where voice fidelity is the load-bearing metric.

      I rotate when a model starts drifting from my voice doc, which happens about every 8 weeks. The assignment above is correct as of this week. By next quarter it will probably look different.

  25. 2

    $19/mo total? I'm paying more than that for Lemlist alone. Where does that actually go?

    1. 1

      Roughly Claude Haiku for monitoring and scoring ($6), GPT-4o for thread relevance ranking and short-form drafts ($7), Claude Sonnet for blog first-drafts and pitch responses ($4), and about $1-2 in Playwright cloud minutes for form filling. Some months it nudges $24. It has not crossed $30.

  26. 1

    The multi-tool problem nobody talks about: using Claude Code + Cursor + Copilot on the same project.

    Each one starts fresh. Each one has different defaults. Each one will make a different decision about the same architectural question — and none of them will tell you they disagree with the others.

    Six months in, the codebase reflects three different opinions about how to structure the same thing.

    The fix: a CLAUDEmd file (works as Cursor rules too) that defines the non-negotiables before any tool touches the code. Stack, patterns, what's forbidden. All tools read the same source of truth.

    It's not about which tool is better. It's about making them agree with each other.

    Anyone else running multiple AI tools on the same codebase? How do you keep them consistent?

    1. 1

      The CLAUDE.md as shared source of truth is the code equivalent of my never-write list. Same principle: the constraint lives upstream of the model, not inside it. Without it each tool optimizes locally and the codebase accumulates three silent opinions about the same problem.

      The distribution version of your problem is running the same pipeline across Claude, GPT-4o, and Haiku without a shared voice doc. Each model drifts toward its own defaults. The output sounds like three different founders. The fix is identical to yours — one document all models read before touching anything.

  27. 1

    This comment was deleted 9 hours ago.

Trending on Indie Hackers
Show IH: I'm building a lead gen + CRM tool for web designers targeting local businesses without websites — starting with Spain User Avatar 66 comments How I built an AI workflow with preview, approval, and monitoring User Avatar 64 comments I built a URL indexing SaaS in 40 days — here's the honest story User Avatar 56 comments I'm a solo founder. It took me 9 months and at least 3 stack rewrites to ship my SaaS. User Avatar 50 comments After 4 landing page rewrites, I finally figured out why my analytics SaaS wasn't converting User Avatar 21 comments