Building an AI manga translator: why "just OCR + translate" doesn’t work

I recently started building a manga translation tool, and I went in with a very naive assumption:

this should be straightforward.

Detect text → OCR → translate → put it back.

That’s it, right?

Turns out, almost every part of that assumption is wrong.

The real problem isn’t translation

What surprised me most is this:

translation is not the hard part.

Modern models can already do a decent job translating Japanese → English.

The real problem is everything around it.

Because manga is not text.

It’s text embedded inside a visual system:

speech bubbles
vertical layout
stylized fonts
background textures

If you get any of these wrong, the result might be technically correct…
but completely unreadable.

The pipeline (and why each step breaks)

I ended up with a pipeline that looks roughly like this:

Upscaling (for low-res scans)
Text detection
OCR
Textline merging
Translation
Inpainting
Rendering

On paper, this looks clean.

In reality, each step introduces new failure modes.

1. Detection is not bounding boxes

Typical OCR assumes rectangular text regions.

Manga doesn’t.

Text can be rotated, curved, or squeezed into irregular speech bubbles.

So instead of boxes, you need polygon-level detection.

2. Generic OCR fails badly

Standard OCR tools struggle with:

vertical text
stylized fonts
low contrast backgrounds

Domain-specific models (like manga-trained OCR) perform much better,
but even then, errors cascade into later steps.

3. Text grouping is a graph problem

Detection gives you fragments.

But translation needs semantic units (a full speech bubble).

Naively grouping by distance fails.

What worked better for me was modeling text lines as a graph:

nodes = text segments
edges = spatial / alignment similarity

Then extracting connected components.

4. Inpainting is underrated

Before rendering, you need to remove original text.

This sounds simple, but it’s not.

You’re asking a model to reconstruct:

screentones
cross-hatching
background patterns

Bad inpainting is immediately noticeable.

5. Rendering is the hardest part

This is the part most people underestimate.

Putting translated text back into the image is not just "draw text".

You have to deal with:

length mismatch
(10 Japanese characters → 40 English characters)
rotation
(tilted dialogue in action scenes)
vertical typography
(which is not just rotated horizontal text)

If this step is wrong, everything else doesn’t matter.

The biggest realization

I thought I was building a translation tool.

But I’m actually building a reading tool.

The real question is not:

"Is this sentence translated correctly?"

It’s:

"Does this still feel like a manga page?"

That changed how I approached everything.

Tradeoffs I didn’t expect

A few things that turned out more important than expected:

Accuracy vs readability
A slightly imperfect translation that fits the bubble > perfect translation that breaks layout
Cost vs UX
Multimodal pipelines are expensive
Every extra step (upscale, inpaint, render) has real cost implications
Latency vs quality
Users don’t want to wait 30 seconds per page

Where I ended up

I built a small browser-based tool to experiment with these ideas:
https://mangatranslator.me

Still early, but it’s been a good way to explore what actually matters in this problem space.

Curious how others would approach this

If you’ve worked on:

OCR
layout reconstruction
multimodal pipelines

I’d be really interested in how you’d approach the rendering problem.

Feels like that’s where most tools still fall apart.