Built a 40k-line Synthetic Data Engine to solve Small Text VLM/OCR. Now looking for the next big challenge (Acqui-hire/Collab)

by Alro Wilde

Hey Indie Hackers,

I'm a solo founder based in Europe, building tools at the intersection of computer vision and enterprise AI. After months of heads-down development.

I'm excited to launch OCR Producer.

GitHub: https://github.com/alrowilde/ocr-producer

Check the GitHub link for side-by-side visual comparisons: You'll see how the engine handles the character-level degradation and complicated layouts that traditional augmentations simply can't mimic.

To be honest, I created this project primarily for OCR tasks, but I designed the different modules independently, the modular design makes it easy to extend for VLM-specific tasks.

A synthetic data engine designed to solve some of the toughest pain points in real-world Document AI.

The Pain Point: Small Text Is Critical, But Always Under-Represented In enterprise OCR (invoices, medical forms, logistics labels, financial reports), there's a nasty imbalance that quietly kills model performance:

Most fields are long text: Company names, addresses, descriptions – easy to spot, plenty of examples in every document.
But the critical fields are tiny: Quantities (e.g., "1", "*" or "#"), unit prices, tax rates, counts in small fonts, many critical categories consist of single characters or very short text, often printed in 6-12pt font, sometimes with noise or occlusion.

In real datasets, small text instances are ridiculously rare – maybe 5-10 per document vs 50+ long fields. When you train a model (whether traditional OCR like PaddleOCR/Tesseract or fine-tuning a VLM), the model learns to nail the easy stuff and consistently fails on the stuff that actually matters most.

How OCR Producer Solves It
I built OCR Producer as an on-premise synthetic data engine that lets you generate unlimited, high-fidelity training samples specifically targeted at these failures.

Key features that address the small text problem:
Realistic business layouts: Invoices, tables, medical forms, multi-column reports – with merged cells, noise, blur, and distorted characters.

On-premise & privacy-safe: Runs 100% locally. No uploading sensitive templates or data to the cloud.
Bonus: Strong CJK support: If you've ever tried OCR on dense East Asian forms, you know the pain. This engine was stress-tested there first.

The results
You can balance your dataset overnight: turn 100 real invoices into 10,000+ synthetic ones with perfect small-text representation, fully labeled.
Instead of waiting for data engineers to process data every time, the data production process can be transformed into a parallel process.

Quick Results I've Seen
In my own tests (using open-source PaddleOCR baselines):
Small text detection becomes far more stable. Overall detection of key business fields improves significantly, though limited layout diversity can still impose some constraints.
Overall document precision improved by 10-15% compared to the baseline PaddleOCR model in dense-text invoice scenarios.

If you're building in FinTech, Healthcare, Logistics, or any regulated space where VLM/OCR precision actually moves the bottom line, I'd love to chat.

I've built this to a point where the core engine is robust (40k+ lines of code, solo-developed) and the architecture allows for rapid debugging and error tracing.

Now, I’m looking to bring this technology into a larger ecosystem. Whether it’s a strategic partnership or joining a team to lead their Document AI efforts, I’m open to discussing how this engine can be integrated into your production pipeline.

Thanks for reading – excited to be part of this community!

If you have any questions, please reach out to me at [email protected]

Alro Wilde

posted to

Artificial Intelligence

on February 14, 2026

Say something nice to Alro_Wilde…

Post Comment

1

Very cool. The small‑text imbalance is real. One thing that helped us was eval on a tiny “hard” set (6–10pt, low‑contrast, skew) to track gains separately from overall OCR. Do you expose controls for font distributions / noise profiles so a team can match their domain? Also curious if you’ve tried pairing synthetic with a small real fine‑tune set to avoid synthetic overfitting.

easy_ai

·
9 hours ago
·
Reply
1. 1
  
  Yes, small-text imbalance is brutal, and tracking a separate “hard” eval set (6–10pt, low contrast, skew etc.) is a great way to measure real progress. Font size / family are fully controllable as a dedicated module.
  
  I’ve handled it with CJK and Arabic — those scripts really punish weak synthesis.
  
  On pure-synthetic overfitting:
  I’ve done the extreme version — train 100% synthetic, zero real data, then test on real Chinese invoices.
  
  The results:
  In-domain layouts & fields (including tiny text) -> often beats baseline PaddleOCR.
  Completely out of domain images -> drops hard (classic domain gap).
  Rare new patterns in edge positions that never appeared in synthetic images -> can fail worse than PaddleOCR.
  
  My fix is just cranking diversity from the start:
  heavy layout randomization + wide range of degradations (low-contrast tiny fonts, occlusions, blur, color jitter, lighting, compression etc.). So far this keeps small critical fields solid in target domains without much overfitting pain.
  With PaddleOCR, the risk feels lower anyway since it crops a 640x640 image from the source image for the training process instead of taking the whole image into it.
  
  Have you experienced any overfitting in your model training? How did you handle it?
  
  Alro_Wilde
  
  ·
  4 hours ago
  ·
  Reply