Built OCR for PDF table extraction - need people to test on messy scanned documents

Hey IH 👋

Shipped OCR to Data Extractio after 17 hours debugging coordinate systems.

The technical problem:
Had to map 4 different coordinate systems (UI, PDF, pdfplumber, OCR) to make two extraction engines work together. One wrong conversion = garbage data.

Tech decision:
Almost used Docling (95% accuracy) but it would've killed what I think is my competitive moat. Chose OCRmyPDF (85-90% accuracy) instead because it preserves:

Interactive table selection
Per-PDF pricing advantage. Most AI tools required you to scan all the file without the ability to just spend on what you only need from your file. That can be translated into more proceesing costs.

Lesson: "Good enough + unique" > "Better + commodity"

What works now:
✅ Scanned invoices (clean scans)
✅ Bank statements
✅ Standard business documents
✅ PDF, JPG, PNG formats
✅ English, Spanish, French, German
✅ Multi-column header adjustment

Limitations (upfront honesty):
❌ No handwritten notes support
❌ Blurry scans will struggle

What I need:
I would like to hear your use case to improve the tool.

Try it: dataextractio.com (15 free credits)

Tell me if it works for your use case.
Honest metrics: 0 paying customers, 17 hours invested.

Who wants to test? 👇
X profile: @MartCervt

Martin Cervantes

posted to

Show IH

on May 7, 2026

Say something nice to Martcervantes…

Post Comment

2

The moat is probably not the OCR accuracy layer.

Most users cannot judge the difference between 88% and 95% extraction before trying it anyway.

The moat is more likely the feeling of control during extraction.

Almost every OCR tool feels like:
“upload file → pray → export mess.”

Your “interactive table selection” angle is the first thing here that actually sounds operational instead of magical.

That is probably the layer worth pushing harder.

Also, “Data Extractio” is costing trust right now.
It sounds temporary / unfinished exactly when you need people trusting document accuracy.

A cleaner infrastructure-style name would carry this much better if the product expands beyond OCR utilities.

Exirra.com would fit this category very naturally.

aryan_sinh

·
8 days ago
·
Reply
1. 1
  
  Thanks for the suggestions, Yes definitely I will push beyond the features of my app. Next step is to make integration with Zapier and include a email workflow.
  
  Martcervantes
  
  ·
  7 days ago
  ·
  Reply
  1. 1
    
    That makes sense.
    
    Zapier and email workflows are useful, but they also make the naming problem more important.
    
    Once this moves beyond OCR into document workflows, “Data Extractio” starts feeling even more temporary.
    
    The product is becoming:
    upload document
    select the right table
    extract clean data
    send it into workflows
    
    That is much more serious than a basic OCR utility.
    
    So I’d be careful not to let the product grow while the name still feels unfinished.
    
    aryan_sinh
    
    ·
    7 days ago
    ·
    Reply
2

"Good enough + unique > better + commodity" — that's the cleanest version of this rule I've seen written down. Made the same call going Supabase RLS over building custom auth, and it's the diff between shipping in week 3 vs week 13.

One Q on pricing: per-PDF means your highest-value users are also your highest-cost users. Have you modeled unit economics at >50 PDFs/mo? Curious if it still pencils or if you cap.

edifierxuhao

·
9 days ago
·
Reply
1. 1
  
  I made the calculations about the cost knowing exactly how much it is going to cost me on processing. The thing here is that I optimize (using GCP) the servers and only few users (users that make a lot of extractions per PDF) that are in the Starter subscription will make me lose money.
  
  Martcervantes
  
  ·
  8 days ago
  ·
  Reply