2
5 Comments

Built OCR for PDF table extraction - need people to test on messy scanned documents

Hey IH đź‘‹

Shipped OCR to Data Extractio after 17 hours debugging coordinate systems.

The technical problem:
Had to map 4 different coordinate systems (UI, PDF, pdfplumber, OCR) to make two extraction engines work together. One wrong conversion = garbage data.

Tech decision:
Almost used Docling (95% accuracy) but it would've killed what I think is my competitive moat. Chose OCRmyPDF (85-90% accuracy) instead because it preserves:

  • Interactive table selection
  • Per-PDF pricing advantage. Most AI tools required you to scan all the file without the ability to just spend on what you only need from your file. That can be translated into more proceesing costs.

Lesson: "Good enough + unique" > "Better + commodity"

What works now:
âś… Scanned invoices (clean scans)
âś… Bank statements
âś… Standard business documents
âś… PDF, JPG, PNG formats
âś… English, Spanish, French, German
âś… Multi-column header adjustment

Limitations (upfront honesty):
❌ No handwritten notes support
❌ Blurry scans will struggle

What I need:
I would like to hear your use case to improve the tool.

Try it: dataextractio.com (15 free credits)

Tell me if it works for your use case.
Honest metrics: 0 paying customers, 17 hours invested.

Who wants to test? 👇
X profile: @MartCervt

posted to Icon for group Show IH
Show IH
on May 7, 2026
  1. 2

    The moat is probably not the OCR accuracy layer.

    Most users cannot judge the difference between 88% and 95% extraction before trying it anyway.

    The moat is more likely the feeling of control during extraction.

    Almost every OCR tool feels like:
    “upload file → pray → export mess.”

    Your “interactive table selection” angle is the first thing here that actually sounds operational instead of magical.

    That is probably the layer worth pushing harder.

    Also, “Data Extractio” is costing trust right now.
    It sounds temporary / unfinished exactly when you need people trusting document accuracy.

    A cleaner infrastructure-style name would carry this much better if the product expands beyond OCR utilities.

    Exirra.com would fit this category very naturally.

    1. 1

      Thanks for the suggestions, Yes definitely I will push beyond the features of my app. Next step is to make integration with Zapier and include a email workflow.

      1. 1

        That makes sense.

        Zapier and email workflows are useful, but they also make the naming problem more important.

        Once this moves beyond OCR into document workflows, “Data Extractio” starts feeling even more temporary.

        The product is becoming:
        upload document
        select the right table
        extract clean data
        send it into workflows

        That is much more serious than a basic OCR utility.

        So I’d be careful not to let the product grow while the name still feels unfinished.

  2. 2

    "Good enough + unique > better + commodity" — that's the cleanest version of this rule I've seen written down. Made the same call going Supabase RLS over building custom auth, and it's the diff between shipping in week 3 vs week 13.

    One Q on pricing: per-PDF means your highest-value users are also your highest-cost users. Have you modeled unit economics at >50 PDFs/mo? Curious if it still pencils or if you cap.

    1. 1

      I made the calculations about the cost knowing exactly how much it is going to cost me on processing. The thing here is that I optimize (using GCP) the servers and only few users (users that make a lot of extractions per PDF) that are in the Starter subscription will make me lose money.

Trending on Indie Hackers
7 years in agency, 200+ B2B campaigns, now building Outbound Glow User Avatar 102 comments 11 Weeks Ago I Had 0 Users. Now VIDI Has Reviewed $10M+ in Contracts - and I’m Opening a Small SAFE Round User Avatar 47 comments The "Book a Demo" Button Was Killing My Pipeline. Here's What I Replaced It With. User Avatar 41 comments I built a desktop app to move files between cloud providers without subscriptions or CLI User Avatar 24 comments How I built an AI workflow with preview, approval, and monitoring User Avatar 23 comments My AI bill was bleeding me dry, so I built a "Smart Meter" for LLMs User Avatar 19 comments