2
22 Comments

Most AI agents fail because of "Dirty Data" Here’s how to fix the pipeline.

I’ve seen a lot of cool AI tools launching lately, but many struggle with hallucinations or poor accuracy. The reason? They are consuming messy HTML instead of structured data.

In my 5 years of Python automation, I’ve realized that building a 'Clean-Feed' pipeline is more important than the LLM prompt itself. I recently optimized a scraper for an AI research tool that:

Bypassed complex bot protections.

Stripped noise (JS/CSS) in real-time.

Delivered 100% RAG-ready markdown.

The result was a 40% increase in agent accuracy and much lower token costs. If you're building an AI agent and hitting data quality bottlenecks, I’d love to share how I structure these scraping engines.

#Python #AI #Automation #WebScraping"

posted to Icon for group AI Tools
AI Tools
on May 15, 2026
  1. 1

    garbage automation is the ultimate truth for agentic workflows right now. Saving this for the next time someone asks me why their expensive LLM agent keeps losing its mind over a basic spreadsheet.

  2. 1

    This resonates — data quality is genuinely the unglamorous work that determines whether an AI agent is useful or just impressive in a demo. We ran into exactly this building a coaching pipeline where the input data is behavioral and transactional rather than scraped HTML, but the same principle holds: garbage in, hallucination out regardless of how good your prompt is.
    The 40% accuracy improvement tracks with what we’ve seen too. Curious how you’re handling the real-time stripping at scale — are you doing that in the scraping layer itself or as a preprocessing step before it hits the context window? We ended up with a separate normalization stage and it made a meaningful difference in token efficiency.

    1. 1

      Spot on, The 'Ferrari with the handbrake on' problem usually boils down to feeding raw, messy spreadsheets straight into the LLM context window. Models are smart, but context noise kills them.

      I run an automation workflow where my agentic stack (using tools like Cline) focuses heavily on a dedicated preprocessing/data-cleaning layer before the LLM even touches the data. It brought down token costs and completely stopped the agent from losing its mind over tabular structure.

      If your agency is currently bottlenecked by cleaning up client spreadsheets and raw HTML data for your AI pipelines, I’d love to take that data-cleaning overhead off your plate on a contract basis. Let me know if you want to swap notes or need an extra hand with the pipeline infrastructure!

    2. 1

      Spot on, Mike! 'Garbage in, hallucination out' is the absolute truth when dealing with LLMs. To answer your question: I prefer handling the initial stripping and extraction right in the scraping layer using custom parsers (like BeautifulSoup/Cheerio) to filter out raw HTML noise early on.

      However, for large-scale operations, a dedicated preprocessing/normalization pipeline (just like you mentioned) is a game-changer. It not only saves massive token costs but also ensures the context window is strictly focused on high-density data. Glad to see someone else validating that 40% efficiency mark!

  3. 1

    Spot on. I see too many teams trying to fix garbage input with complex prompt engineering, which is a losing battle. Raw HTML is a massive token drain.

    You mentioned delivering 100% RAG-ready markdown. While Markdown is excellent for standard semantic search, I've found that when dealing with autonomous agents executing specific tool calls, enforcing a strict JSON schema extraction yields much lower hallucination rates than Markdown. Have you experimented with forcing the pipeline to output validated JSON instead, or does your specific agent architecture prefer the Markdown structure?

    1. 1

      Spot on, Adham! You hit the nail on the head. Garbage in, garbage out no amount of prompt engineering saves you from raw HTML bloat.

      To answer your question: Yes, absolutely. For autonomous agents running function calling or multi-step tool execution, structured JSON is king. It completely eliminates parsing ambiguity.

      In our current architecture, we actually use a hybrid approach. We extract 100% clean Markdown as the base semantic layer for the broader knowledge retrieval (RAG), but when it comes to the agent's action-triggering layer, we enforce strict Pydantic/JSON schemas to lock down tool arguments.

      It really depends on the workflow Markdown gives the LLM great contextual layout for reading, but JSON is the glue for execution. Have you tried combining both layers, or do you strictly feed raw JSON into your agent's context?

      1. 1

        The hybrid approach makes sense, but there is a hard constraint to watch out for: feeding both layers into the same context window is a trap.

        Passing heavy Markdown alongside strict Pydantic schemas to a single agent invites context pollution. The larger the Markdown payload, the more likely the LLM degrades its JSON adherence or drops tool arguments due to attention drift.

        The most reliable way to scale this is strict isolation using a two-node setup:

        The Reader (Semantic): Ingests the Markdown from RAG and synthesizes the context into a structured JSON state.

        The Executor (Action): Ingests only that JSON state to fire the tool. Zero Markdown enters the execution context.

        Decoupling the semantic reading from the tool execution is the cleanest way to avoid that bottleneck. Glad to see the space is converging on these same architectural challenges.

        1. 1

          Spot on, Adham! You hit the nail on the head. Forcing an LLM to balance heavy semantic interpretation (Markdown) and strict structural constraints (Pydantic/JSON) in a single pass is a recipe for attention drift.

          Decoupling into a two-node architecture separating the 'Reader' from the 'Executor' is absolutely the cleanest way to scale. It completely protects tool adherence and keeps the execution context predictable. Love this architectural breakdown, thanks for adding such high-value insight to the thread!

  4. 1

    Solid point on data quality, though " 40% increase in accuracy" without baseline or methodology is a number i'd take with a grain of salt. What were you measuring and how?

    1. 1

      You’re 100% right to take unbaselined numbers with a grain of salt vanilla marketing metrics don’t work in engineering.

      To clarify, when we talk about a 40% efficiency or accuracy bump in agentic workflows after fixing data schemas, the baseline is typically measured against standard raw LLM parsing (e.g., feeding unstructured, un-normalized CSVs/Notion exports straight into a GPT-4o or Claude 3.5 Sonnet context window).

      The measurement methodology splits into three core vectors:

      Token Efficiency & Noise Reduction: Stripping structural anomalies reduces token bloat by up to 30-40%, meaning fewer context windows are wasted on error-handling or parsing syntax.

      Deterministic vs. Probabilistic Fields: When an agent looks for a "Status" field in a structured JSON schema versus an unpredictable text string, hallucination rates drop drastically. We measure the reduction in failed agent tool-calls (failed loops).

      F1-Score on Extraction: Running a custom Python evaluation script (using fuzzy matching or exact schema validation) to check how many scraped entities actually map correctly to the destination database on the first pass.

      Without a strict relational schema, an LLM agent is basically guessing the context. Clean architecture turns a 60% success rate into a predictable 95%+ execution.

      1. 1

        Looks great. Cheers.

  5. 1

    The dirty data problem hits solopreneurs hardest because they rarely have a defined data model before they start building.

    A solo founder's 'CRM' is often a mix: some leads in a spreadsheet, some in email threads, some in a Notion page that hasn't been updated since Q1. When an AI agent tries to work with that, it's not just dirty - it's structurally incoherent. The agent can't find the information it needs because the information was never organized to be found.

    The fix isn't just cleaning the data - it's defining the schema first. What is a 'contact'? What fields matter? How does a deal relate to a project? Once you have a consistent relational model (even in something as simple as linked Notion databases), AI tooling on top becomes dramatically more reliable. Garbage in, garbage out applies 10x harder to agentic systems because the failures compound across steps.

    1. 1

      Spot on! "Structurally incoherent" is the perfect way to describe it. Most solopreneurs build their tech stack like a jigsaw puzzle with missing pieces, and then expect an AI agent to magically solve it.

      Working heavily with Python automation and data extraction, I see this failure loop daily. If your data doesn't have a clean, predictable schema, the LLM or agent spends 80% of its tokens just trying to parse the noise, leading to hallucinated context and compounded failures.

      The real shift happens when you move from "cleaning data" to "architecting pipelines." Even if it’s just a properly normalized Postgres instance or strictly linked Notion dbs, a rigid schema is what transforms an AI agent from a buggy chatbot into a hyper-reliable worker. Garbage structure = garbage automation.

    2. 1

      Spot on! You hit the exact root cause.

      Most solopreneurs think prompt engineering or buying a more expensive AI model will fix their agent's failures. But the reality is: Garbage Infrastructure = Garbage Output.

      When an AI agent runs into a structurally incoherent Notion page or an unformatted spreadsheet, its context window gets cluttered with noise, leading to massive hallucinations and compounded errors across steps.

      My approach when fixing this for founders:
      Before connecting any LLM or AI agent, I use Python scripts to parse their messy data sources (emails, spreadsheets, Notion) and map them into a strict, unified schema first. Even turning a flat spreadsheet into a clean relational structure with proper links makes the AI agent 10x more reliable and saves dollars on API token waste.

      Defining the schema is the real "Human-in-the-Loop" job that unlocks proper automation. Love this insight!

  6. 1

    The "garbage in, garbage out" principle hits hard with AI agents. Cleaning HTML before feeding it to an LLM is something most people skip and then wonder why their outputs are unreliable. 40% accuracy increase just from better input is a strong number. How do you handle pages where the actual content is loaded dynamically via JavaScript — do you run a headless browser or have you found a lighter approach?

    1. 1

      Exactly! Clean data is the difference between a toy and a production grade AI agent.

      To answer your question: I avoid the 'Headless Browser' trap as much as possible because it's a resource hog. My first move is always to inspect the Network Tab and reverse engineer the internal API endpoints. Fetching raw JSON is 100x lighter than rendering a full DOM.

      If the site has heavy anti-bot protections or shadowed DOMs, I skip Selenium and go for a Playwright + stealth plugin setup running in a container. It’s the only way to stay fast while bypassing those JS challenges.

      What's your stack for pre-processing? Are you using BeautifulSoup or a more custom regex-based cleaner?

      1. 1

        Smart approach with the Network Tab — reverse engineering internal APIs is way more efficient than rendering the full DOM. The Playwright + stealth plugin combo makes sense for heavy anti-bot sites. For pre-processing I've mostly seen custom pipelines work better than BeautifulSoup alone — regex for the initial strip, then a structured parser for the actual content extraction. The key is having a fallback chain so one method failing doesn't kill the whole pipeline.

        1. 1

          Exactly! A robust fallback chain is what separates a script from a production-grade system. If the API endpoint changes, the Playwright stealth engine kicks in that’s the only way to maintain 99% uptime in scraping.

          I actually use this exact stack to build Predictable B2B Pipelines. Most Lead Gen agencies fail because their scrapers break or get blocked, but with this technical approach, I can extract 'Intent-based' leads that others can't even see.

          Since you've got the tech stack figured out, are you using this for your own SaaS growth or providing it as a service? I’m currently scaling a few automated outreach engines and would love to swap notes on how you're handling the 'Signal-to-Noise' ratio once the data hits the LLM.

          1. 1

            Right now using it for my own SaaS growth — building an image processing API and the clean data pipeline mindset applies heavily to how I handle input validation and format detection. For signal-to-noise on the LLM side, the biggest win I've found is stripping everything before the data even touches the model — if you're feeding it raw scraped content you're burning tokens on noise. A pre-filter that scores content relevance before LLM processing cuts costs and improves output quality significantly. Would be interesting to compare approaches on the outreach engine side.

            1. 1

              Valid points on the token burn. Raw scraped content is a goldmine for noise feeding that directly to an LLM is a guaranteed way to exhaust API limits and tank accuracy. Love the pre-filter scoring approach; dropping low-relevance data upstream saves massive context window overhead.

              For my image processing pipelines, I find that enforce-strict structural validation right at the ingestion layer completely changes the game before any format detection even kicks in.

              On the outreach engine side, my approach is heavily data-driven. Instead of volume blasting, I build custom Python scrapers that monitor real-time triggers across X/Twitter and niche forums, auto-filter the noise using a strict criteria schema, and flag high-intent opportunities instantly. It keeps the signal high and the outreach highly hyper-personalized.

              I’d absolutely love to exchange notes and compare frameworks on this! Are you more active here, or can we connect on LinkedIn? Drop me your handle or feel free to DM me here!

              1. 1

                The real-time trigger monitoring across X and niche forums is a solid approach — way better than batch scraping. Most active here on IH for now. Feel free to DM me anytime, happy to swap notes on the filtering frameworks and signal scoring. Always good to connect with someone who actually thinks about data quality before throwing everything at an LLM.

                1. 1

                  Hey Mike, reaching out from the IH thread!

                  Really glad to connect with someone who understands the backend struggle of data density. I've been refining my ingestion pipeline to score signals in real-time, mainly because processing raw, unfiltered scraping dumps through LLMs is just burning API credits for zero ROI.

                  I’d love to hear how you guys are structuring your normalization stage. Are you running custom heuristics or using a lightweight model to filter out the noise before it hits the main LLM pipeline? Looking forward to swapping some notes!

Trending on Indie Hackers
The hardest part isn't building anymore User Avatar 88 comments I sold $6,773 in 2 weeks, with almost no existing community. User Avatar 60 comments Before you build another feature, use this workflow User Avatar 40 comments Ferguson is LIVE on ProductHunt today... so I audited their homepage first! User Avatar 38 comments Built a local-first Amazon profit-by-SKU + QuickBooks/Xero journal tool. Looking for founding users. User Avatar 32 comments I spent months chasing clients who already had a webmaster. So I built something that only finds the ones who don't. User Avatar 26 comments