I recently started building a manga translation tool, and I went in with a very naive assumption:
this should be straightforward.
Detect text → OCR → translate → put it back.
That’s it, right?
Turns out, almost every part of that assumption is wrong.
What surprised me most is this:
translation is not the hard part.
Modern models can already do a decent job translating Japanese → English.
The real problem is everything around it.
Because manga is not text.
It’s text embedded inside a visual system:
If you get any of these wrong, the result might be technically correct…
but completely unreadable.
I ended up with a pipeline that looks roughly like this:
On paper, this looks clean.
In reality, each step introduces new failure modes.
Typical OCR assumes rectangular text regions.
Manga doesn’t.
Text can be rotated, curved, or squeezed into irregular speech bubbles.
So instead of boxes, you need polygon-level detection.
Standard OCR tools struggle with:
Domain-specific models (like manga-trained OCR) perform much better,
but even then, errors cascade into later steps.
Detection gives you fragments.
But translation needs semantic units (a full speech bubble).
Naively grouping by distance fails.
What worked better for me was modeling text lines as a graph:
Then extracting connected components.
Before rendering, you need to remove original text.
This sounds simple, but it’s not.
You’re asking a model to reconstruct:
Bad inpainting is immediately noticeable.
This is the part most people underestimate.
Putting translated text back into the image is not just "draw text".
You have to deal with:
length mismatch
(10 Japanese characters → 40 English characters)
rotation
(tilted dialogue in action scenes)
vertical typography
(which is not just rotated horizontal text)
If this step is wrong, everything else doesn’t matter.
I thought I was building a translation tool.
But I’m actually building a reading tool.
The real question is not:
"Is this sentence translated correctly?"
It’s:
"Does this still feel like a manga page?"
That changed how I approached everything.
A few things that turned out more important than expected:
Accuracy vs readability
A slightly imperfect translation that fits the bubble > perfect translation that breaks layout
Cost vs UX
Multimodal pipelines are expensive
Every extra step (upscale, inpaint, render) has real cost implications
Latency vs quality
Users don’t want to wait 30 seconds per page
I built a small browser-based tool to experiment with these ideas:
https://mangatranslator.me
Still early, but it’s been a good way to explore what actually matters in this problem space.
If you’ve worked on:
I’d be really interested in how you’d approach the rendering problem.
Feels like that’s where most tools still fall apart.