1
0 Comments

Building an AI manga translator: why "just OCR + translate" doesn’t work

I recently started building a manga translation tool, and I went in with a very naive assumption:

this should be straightforward.

Detect text → OCR → translate → put it back.

That’s it, right?

Turns out, almost every part of that assumption is wrong.

The real problem isn’t translation

What surprised me most is this:

translation is not the hard part.

Modern models can already do a decent job translating Japanese → English.

The real problem is everything around it.

Because manga is not text.

It’s text embedded inside a visual system:

  • speech bubbles
  • vertical layout
  • stylized fonts
  • background textures

If you get any of these wrong, the result might be technically correct…
but completely unreadable.

The pipeline (and why each step breaks)

I ended up with a pipeline that looks roughly like this:

  1. Upscaling (for low-res scans)
  2. Text detection
  3. OCR
  4. Textline merging
  5. Translation
  6. Inpainting
  7. Rendering

On paper, this looks clean.

In reality, each step introduces new failure modes.

1. Detection is not bounding boxes

Typical OCR assumes rectangular text regions.

Manga doesn’t.

Text can be rotated, curved, or squeezed into irregular speech bubbles.

So instead of boxes, you need polygon-level detection.

2. Generic OCR fails badly

Standard OCR tools struggle with:

  • vertical text
  • stylized fonts
  • low contrast backgrounds

Domain-specific models (like manga-trained OCR) perform much better,
but even then, errors cascade into later steps.

3. Text grouping is a graph problem

Detection gives you fragments.

But translation needs semantic units (a full speech bubble).

Naively grouping by distance fails.

What worked better for me was modeling text lines as a graph:

  • nodes = text segments
  • edges = spatial / alignment similarity

Then extracting connected components.

4. Inpainting is underrated

Before rendering, you need to remove original text.

This sounds simple, but it’s not.

You’re asking a model to reconstruct:

  • screentones
  • cross-hatching
  • background patterns

Bad inpainting is immediately noticeable.

5. Rendering is the hardest part

This is the part most people underestimate.

Putting translated text back into the image is not just "draw text".

You have to deal with:

  • length mismatch
    (10 Japanese characters → 40 English characters)

  • rotation
    (tilted dialogue in action scenes)

  • vertical typography
    (which is not just rotated horizontal text)

If this step is wrong, everything else doesn’t matter.

The biggest realization

I thought I was building a translation tool.

But I’m actually building a reading tool.

The real question is not:

"Is this sentence translated correctly?"

It’s:

"Does this still feel like a manga page?"

That changed how I approached everything.

Tradeoffs I didn’t expect

A few things that turned out more important than expected:

  • Accuracy vs readability
    A slightly imperfect translation that fits the bubble > perfect translation that breaks layout

  • Cost vs UX
    Multimodal pipelines are expensive
    Every extra step (upscale, inpaint, render) has real cost implications

  • Latency vs quality
    Users don’t want to wait 30 seconds per page

Where I ended up

I built a small browser-based tool to experiment with these ideas:
https://mangatranslator.me

Still early, but it’s been a good way to explore what actually matters in this problem space.

Curious how others would approach this

If you’ve worked on:

  • OCR
  • layout reconstruction
  • multimodal pipelines

I’d be really interested in how you’d approach the rendering problem.

Feels like that’s where most tools still fall apart.

posted to Icon for group AI Tools
AI Tools
on March 24, 2026
Trending on Indie Hackers
30 days ago I posted here with $0 revenue. Here's what actually happened next. User Avatar 148 comments I used $30,983 of AI tokens last month in Claude code on $200/mo plan User Avatar 90 comments my reddit post got 600K+ views. here's exactly what i did User Avatar 58 comments How to spot high-intent customers in 5 minutes, for free. User Avatar 44 comments Fixing broken scrapers instead of working on my actual product. So I made it my problem. User Avatar 38 comments I Built a Habit Tracker SaaS Alone in 6 Weeks (No CS Degree, No Team). Here's Exactly How User Avatar 38 comments