1
0 Comments

Building an AI manga translator: why "just OCR + translate" doesn’t work

I recently started building a manga translation tool, and I went in with a very naive assumption:

this should be straightforward.

Detect text → OCR → translate → put it back.

That’s it, right?

Turns out, almost every part of that assumption is wrong.

The real problem isn’t translation

What surprised me most is this:

translation is not the hard part.

Modern models can already do a decent job translating Japanese → English.

The real problem is everything around it.

Because manga is not text.

It’s text embedded inside a visual system:

  • speech bubbles
  • vertical layout
  • stylized fonts
  • background textures

If you get any of these wrong, the result might be technically correct…
but completely unreadable.

The pipeline (and why each step breaks)

I ended up with a pipeline that looks roughly like this:

  1. Upscaling (for low-res scans)
  2. Text detection
  3. OCR
  4. Textline merging
  5. Translation
  6. Inpainting
  7. Rendering

On paper, this looks clean.

In reality, each step introduces new failure modes.

1. Detection is not bounding boxes

Typical OCR assumes rectangular text regions.

Manga doesn’t.

Text can be rotated, curved, or squeezed into irregular speech bubbles.

So instead of boxes, you need polygon-level detection.

2. Generic OCR fails badly

Standard OCR tools struggle with:

  • vertical text
  • stylized fonts
  • low contrast backgrounds

Domain-specific models (like manga-trained OCR) perform much better,
but even then, errors cascade into later steps.

3. Text grouping is a graph problem

Detection gives you fragments.

But translation needs semantic units (a full speech bubble).

Naively grouping by distance fails.

What worked better for me was modeling text lines as a graph:

  • nodes = text segments
  • edges = spatial / alignment similarity

Then extracting connected components.

4. Inpainting is underrated

Before rendering, you need to remove original text.

This sounds simple, but it’s not.

You’re asking a model to reconstruct:

  • screentones
  • cross-hatching
  • background patterns

Bad inpainting is immediately noticeable.

5. Rendering is the hardest part

This is the part most people underestimate.

Putting translated text back into the image is not just "draw text".

You have to deal with:

  • length mismatch
    (10 Japanese characters → 40 English characters)

  • rotation
    (tilted dialogue in action scenes)

  • vertical typography
    (which is not just rotated horizontal text)

If this step is wrong, everything else doesn’t matter.

The biggest realization

I thought I was building a translation tool.

But I’m actually building a reading tool.

The real question is not:

"Is this sentence translated correctly?"

It’s:

"Does this still feel like a manga page?"

That changed how I approached everything.

Tradeoffs I didn’t expect

A few things that turned out more important than expected:

  • Accuracy vs readability
    A slightly imperfect translation that fits the bubble > perfect translation that breaks layout

  • Cost vs UX
    Multimodal pipelines are expensive
    Every extra step (upscale, inpaint, render) has real cost implications

  • Latency vs quality
    Users don’t want to wait 30 seconds per page

Where I ended up

I built a small browser-based tool to experiment with these ideas:
https://mangatranslator.me

Still early, but it’s been a good way to explore what actually matters in this problem space.

Curious how others would approach this

If you’ve worked on:

  • OCR
  • layout reconstruction
  • multimodal pipelines

I’d be really interested in how you’d approach the rendering problem.

Feels like that’s where most tools still fall apart.

posted to Icon for group AI Tools
AI Tools
on March 24, 2026
Trending on Indie Hackers
I shipped a productivity SaaS in 30 days as a solo dev — here's what AI actually changed (and what it didn't) User Avatar 258 comments Never hire an SEO Agency for your Saas Startup User Avatar 107 comments A simple way to keep AI automations from making bad decisions User Avatar 71 comments 85% of visitors leave our pricing page without buying. sharing our raw funnel data User Avatar 39 comments Are indie makers actually bad customers? User Avatar 39 comments We automated our business vetting with OpenClaw User Avatar 38 comments