How I did backlink gap analysis for free with Common Crawl + DuckDB

I've been doing SEO work on lodd.dev, the analytics tool I'm building. Trying to be a good little side-project-er and do the marketing work, not just fiddle with the code.

A task I've been thinking about, but not spent the time and money on, is a backlink analysis. The tools for this, Ahrefs and Majestic, are $100+/mo, not worth it for my early project.

While poking around I found a cheaper one, that pointed me in the right direction: "built on the public Common Crawl release." Turns out, Common Crawl is the free, open crawl of the web, and they also publish a webgraph, a domain-to-domain link map covering much of the web with an authority score for every domain. All free to download.

So instead of paying anyone I grabbed it myself. It's three files: every domain with an ID, the edges (who links to whom), and the authority ranks. The edges file is about 16GB. I filtered it down to the links pointing at the handful of competitors I care about, joined it back to domain names in DuckDB, and pulled out every domain that links to them but not to me, ranked by authority.

It's not a free Ahrefs though. The big limitation is that the graph is domain-level. You get "domain A links to domain B" and nothing else, no link to the actual page, no anchor text, no dofollow or nofollow. So you can't tell a real editorial mention from a "powered by" badge sitting in someone's footer. You just know a link exists somewhere on the domain.

That also means a lot of noise. My first pass had about 4,600 domains, GitHub, Vercel and Netlify right at the top of the list. They're not writing about analytics tools, they just host thousands of customer sites that happen to embed one, and every one of those counts as the platform linking out. I had to strip the big hosting domains out before the list was any use.

It's also out of date. Common Crawl publishes the webgraph a every quarter or so, so it's no good for "did my campaign land this week", it's months old by the time you query it. And it only covers what Common Crawl crawls, which is a big slice of the web but nowhere near all of it, and skewed towards established sites. A tool running its own crawler sees more, and sees it sooner. That's probably why Ahrefs is so expensive.

But for a quick, free "who should I be reaching out to" pass, it's probably worth the effort.

How to do it yourself

You need two free tools: ripgrep and DuckDB (brew install ripgrep duckdb).

Get the data. Common Crawl's domain-level webgraph, latest release, lives at data.commoncrawl.org under projects/hyperlinkgraph. Three gzipped files: the vertices (every domain + a numeric ID), the edges (ID links to ID, ~16GB), and the ranks (authority per domain).

Mind the reversed domains. In the files, plausible.io is written io.plausible. Reverse yours and your competitors' before you search.

Look up the IDs. grep your competitors' reversed domains in the vertices file to get their numeric IDs.

Filter the edges. Stream the 16GB edges file through ripgrep, keeping only rows whose target ID is one of your competitors. Use ripgrep, not awk, awk parses every line and was about 30x slower for me. You're left with a few hundred thousand rows.

Join it in DuckDB. Read the filtered edges, the vertices and the ranks. Map the IDs back to names, attach the authority score, group by referring domain, keep the ones that link to your competitors but not you, drop the big hosting platforms, and sort by authority. Export to CSV.

That's the whole thing. I wrapped it in a small shell script so I can rerun it for any product in one command.

Holen Ventures

on June 2, 2026

Say something nice to Hventures…

Post Comment

1

The domain-level limitation is the key tradeoff here. You get scale and cost efficiency, but lose semantic clarity (page-level intent, anchor context, and link quality signals). That’s usually where most “false positives” come from — especially with platforms like GitHub, Vercel, Netlify, etc. Filtering those out is basically required if you want actionable outreach lists rather than raw graphs.

Also interesting point on freshness — that lag alone is often what separates “strategy research tools” from “campaign tracking tools.”

I’m actually working with a small team building systems around data pipelines + AI-assisted workflows for turning raw web signals into actionable outreach and automation flows. This kind of pipeline (Common Crawl → processing → prioritization → action layer) is exactly the space we’re exploring.

If you’re open to it, I’d be happy to exchange ideas or explore overlap in what you’re building.

You can reach me here:
WhatsApp: +1 (361) 332-6512

LucasAlvarez

·
4 days ago
·
Reply
1. 1
  
  Good points, this is definitely just a spot check, not a ongoing tracking. This is just a little thing I discovered while working on my web analytics for agents tool, so not something I'm focusing on, but appreciate the offer!
  
  Hventures
  
  ·
  4 days ago
  ·
  Reply