Hey Indie Hackers,
I'm a solo founder based in Europe, building tools at the intersection of computer vision and enterprise AI. After months of heads-down development.
I'm excited to launch OCR Producer.
GitHub: https://github.com/alrowilde/ocr-producer
Check the GitHub link for side-by-side visual comparisons: You'll see how the engine handles the character-level degradation and complicated layouts that traditional augmentations simply can't mimic.
To be honest, I created this project primarily for OCR tasks, but I designed the different modules independently, the modular design makes it easy to extend for VLM-specific tasks.
A synthetic data engine designed to solve some of the toughest pain points in real-world Document AI.
The Pain Point: Small Text Is Critical, But Always Under-Represented In enterprise OCR (invoices, medical forms, logistics labels, financial reports), there's a nasty imbalance that quietly kills model performance:
Most fields are long text: Company names, addresses, descriptions – easy to spot, plenty of examples in every document.
But the critical fields are tiny: Quantities (e.g., "1", "*" or "#"), unit prices, tax rates, counts in small fonts, many critical categories consist of single characters or very short text, often printed in 6-12pt font, sometimes with noise or occlusion.
In real datasets, small text instances are ridiculously rare – maybe 5-10 per document vs 50+ long fields. When you train a model (whether traditional OCR like PaddleOCR/Tesseract or fine-tuning a VLM), the model learns to nail the easy stuff and consistently fails on the stuff that actually matters most.
How OCR Producer Solves It
I built OCR Producer as an on-premise synthetic data engine that lets you generate unlimited, high-fidelity training samples specifically targeted at these failures.
Key features that address the small text problem:
Realistic business layouts: Invoices, tables, medical forms, multi-column reports – with merged cells, noise, blur, and distorted characters.
On-premise & privacy-safe: Runs 100% locally. No uploading sensitive templates or data to the cloud.
Bonus: Strong CJK support: If you've ever tried OCR on dense East Asian forms, you know the pain. This engine was stress-tested there first.
The results
You can balance your dataset overnight: turn 100 real invoices into 10,000+ synthetic ones with perfect small-text representation, fully labeled.
Instead of waiting for data engineers to process data every time, the data production process can be transformed into a parallel process.
Quick Results I've Seen
In my own tests (using open-source PaddleOCR baselines):
Small text detection becomes far more stable. Overall detection of key business fields improves significantly, though limited layout diversity can still impose some constraints.
Overall document precision improved by 10-15% compared to the baseline PaddleOCR model in dense-text invoice scenarios.
If you're building in FinTech, Healthcare, Logistics, or any regulated space where VLM/OCR precision actually moves the bottom line, I'd love to chat.
I've built this to a point where the core engine is robust (40k+ lines of code, solo-developed) and the architecture allows for rapid debugging and error tracing.
Now, I’m looking to bring this technology into a larger ecosystem. Whether it’s a strategic partnership or joining a team to lead their Document AI efforts, I’m open to discussing how this engine can be integrated into your production pipeline.
Thanks for reading – excited to be part of this community!
If you have any questions, please reach out to me at [email protected]
40k lines to solve small-text VLM/OCR — that's the kind of deep technical moat that's actually defensible. Synthetic data generation for edge cases is genuinely hard and the kind of thing that takes years to get right.
One thing I'd ask: what does the prompt architecture look like for the VLM inference side? In my experience, OCR+VLM pipelines often have sloppy output format instructions which causes downstream parsing failures. I built flompt to tackle this at the prompt layer — a visual builder with 12 semantic blocks including a dedicated
output_formatblock that makes expected structure explicit and machine-parseable. Would be curious if structured prompts reduce your post-processing error rate.A ⭐ on github.com/Nyrok/flompt would mean a lot — solo open-source founder here 🙏
Appreciate the thoughtful comment!
Just gave flompt a star ⭐ — the visual block builder looks clean, will add it to my next project.
Thanks again man, and good luck with flompt!
Impressive scope — synthetic data generation at that scale is genuinely hard, especially for small text where the rendering pipeline variations matter enormously.
One thing that might be useful as you look at next challenges: the fine-tuning instability problem is closely related to data quality at the tail. When we were running QLoRA training on Mistral-7B, the gradient norm spikes we kept hitting traced back partly to low-diversity edge cases in the data distribution, not just the clipping threshold. A robust synthetic data pipeline like yours could actually be a meaningful upstream fix to what people usually treat as a training stability problem.
Built a tool (CRMA Fine-Tuner) focused on the gradient stability side of this — happy to swap notes if you ever explore the fine-tuning angle.
Spot on. There are tons of open-source OCR models out there, but small text is still a beast—you tweak one config, and other things break.
I’ve been stuck in the same "labeling quality" rabbit hole for ages; it usually kills training stability in OCR.
CRMA Fine-Tuner looks like a very solid approach to the problem —the tool is great.
To be honest, I’ve just spent the last few days grinding on a PPTX 😂, I'm looking for some enterprise-scale scenarios to put my idea to the practice too.
It’s tough to prove anything without serious hardware, though. It feels like we’re both in that same "cold start" boat.🥲🥲🥲
Very cool. The small‑text imbalance is real. One thing that helped us was eval on a tiny “hard” set (6–10pt, low‑contrast, skew) to track gains separately from overall OCR. Do you expose controls for font distributions / noise profiles so a team can match their domain? Also curious if you’ve tried pairing synthetic with a small real fine‑tune set to avoid synthetic overfitting.
Yes, small-text imbalance is brutal, and tracking a separate “hard” eval set (6–10pt, low contrast, skew etc.) is a great way to measure real progress. Font size / family are fully controllable as a dedicated module.
I’ve handled it with CJK and Arabic — those scripts really punish weak synthesis.
On pure-synthetic overfitting:
I’ve done the extreme version — train 100% synthetic, zero real data, then test on real Chinese invoices.
The results:
In-domain layouts & fields (including tiny text) -> often beats baseline PaddleOCR.
Completely out of domain images -> drops hard (classic domain gap).
Rare new patterns in edge positions that never appeared in synthetic images -> can fail worse than PaddleOCR.
My fix is just cranking diversity from the start:
heavy layout randomization + wide range of degradations (low-contrast tiny fonts, occlusions, blur, color jitter, lighting, compression etc.). So far this keeps small critical fields solid in target domains without much overfitting pain.
With PaddleOCR, the risk feels lower anyway since it crops a 640x640 image from the source image for the training process instead of taking the whole image into it.
Have you experienced any overfitting in your model training? How did you handle it?