How I Run a 1.7M Product Search Engine at 66ms on a $0 Hosting Budget

by frankzero

I'm a solo founder bootstrapping from Cameroon. No VC. No team. No AWS bill I can expense.

Just me, a 6GB ARM box on Oracle Cloud Free Tier, and a product safety database that grew
to 1.7 million rows before I had any real infrastructure to handle it.

The platform is called LemonKnows — it tracks government recalls, appliance defects,
and ingredient safety data across 41,000+ global recall records. Think "can I trust this
product before I buy it?" but actually fast and actually free for users.

Here are the three architectural decisions that kept the whole thing from falling over.

1. I Killed COUNT(*) Entirely

Early on, every category page triggered this:

SELECT COUNT(*) FROM products WHERE category = 'automotive' AND status = 'active';

On 1.7M rows, PostgreSQL's MVCC means full sequential scans. Under any real load, this
stalled the server for up to 45 seconds.

The fix was brutal in its simplicity — I stopped asking the database to count things at
query time. I built a stats_cache table: a flat key-value store of pre-aggregated
counters. Every time a background ingestion job runs or a product status changes, it
increments or decrements the relevant counter.

Front-end requests now do a 2ms primary key lookup instead of a 45-second scan:

SELECT total_count FROM stats_cache WHERE key = 'products_automotive_active' LIMIT 1;

The tradeoff: counts are eventually consistent, not real-time. For a consumer safety
product, that's completely fine.

2. Composite Indexes — Not Single-Column Ones

I was indexing category, status, and score separately. Completely useless when your
actual query filters by all three simultaneously and then orders by score.

Dropped the single-column noise. Deployed one targeted composite index:

CREATE INDEX idx_products_cat_status_score
ON products (category, status, score DESC);

Column order matters. It mirrors the exact execution sequence of the WHERE and
ORDER BY clauses, which lets the query planner skip sequential scans entirely and go
straight to index-only lookups.

3. A Zero-Cost Multi-Language AI Translation Layer

LemonKnows serves 7 languages. Dynamically calling an LLM API on every international
request would have destroyed both latency and budget immediately.

Instead, I built a translation proxy that treats localization as a content-addressed cache.
The flow:

Incoming request for German text → generate SHA-256 hash of the English source string
Look up ugc_translations where hash = SHA256 and lang = 'de'
Cache hit → return local text instantly. API cost: $0
Cache miss → background worker calls Gemini 2.5 Flash Lite, writes result to DB,
returns to user

Every string gets translated exactly once. After that, it's free to serve to any
international user or search engine crawler forever.

The Result

1.7M products. 41,000 recalls. 7 languages. One free VM. 66ms load times.

No magic. Just aggressive caching, the right indexes, and refusing to pay for things that
don't need to cost money.

We are officially launching on Product Hunt this Tuesday — but more than the launch,
I'm genuinely curious how other people are solving these problems on constrained
infrastructure.

My Actual Questions for IH

How are you handling heavy aggregation queries without letting your DB become the
bottleneck? Async cache invalidation like this, or something smarter?
What's your translation strategy for multi-language products? LLM-backed, static files,
or a paid service?
Anyone else running production workloads on Oracle Free Tier? What's been your breaking
point?

Brutal architectural roasts welcome. I've made plenty of mistakes and there's still a
lot I'd love to pressure-test.

→ lemonknows.com

frankzero

on June 24, 2026

Say something nice to frankzero…

Post Comment

1

The $0 budget isn't a limitation — it's a forcing function. Most founders over-provision because AWS credits or VC money make it easy. The real skill is knowing what your system actually needs vs what's comfortable.

The COUNT(*) -> stats_cache pattern is the right call. I'd add: same logic applies to the translation cache. Most teams would build this with a Redis cluster and a translation service. You did it with a hash lookup in Postgres. Same result, zero complexity.

Curious how you handle cache invalidation for the stats_cache when a bulk ingestion runs. Do you batch update or recalculate on a schedule?

paradox07

·
2 days ago
·
Reply
1. 1
  
  You nailed the forcing function framing exactly. Constraints don't just save money; they force you to understand what your system actually needs versus what feels safe to provision. For example, I have found that a 1 GB instance using Cloudflare for proxy and caching, plus a bit of Redis, can actually outperform a 6 GB instance without it in most cases, as it is not your system RAM being used but rather that of Cloudflare. Another thing I have noticed is that most Redis clusters I've seen in small projects exist because someone was comfortable with Redis, not because the use case demanded it.
  On the stats_cache invalidation during bulk ingestion — I batch update. When a pipeline run completes, a single reconciliation pass computes the delta from that run and applies it in one transaction. I'm not recalculating from source truth on every run, which would defeat the purpose, but I do have a scheduled full recompute job that runs nightly as a sanity check to catch any drift from missed decrements.
  The honest answer is the nightly reconciliation exists because I don't fully trust the incremental path yet. There are edge cases in the ingestion pipeline — duplicate source records, status rollbacks, partial failures — where a counter can silently drift by one or two. Not catastrophic, but on a safety platform even stale aggregate counts feel wrong. The scheduled recompute is cheap insurance until I'm confident the delta logic is airtight.
  
  frankzero
  
  ·
  a day ago
  ·
  Reply
1

That's seriously impressive.

I'm curious—what ended up being the hardest scaling problem that nobody talks about?

FounderFlow_57

·
2 days ago
·
Reply
1. 1
  
  Honestly? Content fragmentation at scale.
  The database was fast. The data was a mess.
  Ingesting from wildly inconsistent government sources — CPSC, EU RAPEX, Health Canada — each with completely different schemas, different naming conventions for the same product, different recall severity classifications. Normalizing that into something a user can actually trust is the unglamorous work nobody writes architecture posts about.
  Infrastructure scaling is a solved problem. Data quality scaling at 1.7M rows is mostly just you, alone, making judgment calls at 2am.
  
  frankzero
  
  ·
  2 days ago
  ·
  Reply
  1. 1
    
    That really resonates.I think data normalization is one of those invisible problems that only becomes obvious once you're operating at scale. People celebrate fast infrastructure because it's easy to measure, but they rarely see the thousands of small judgment calls that make the data trustworthy in the first place.
    Your point about "the database was fast, the data was a mess" really stood out to me. I'd much rather have slower systems with reliable data than lightning-fast systems built on inconsistent information.
    Was there a particular rule or framework you developed for making those judgment calls, or was it mostly experience accumulating over time?
    
    FounderFlow_57
    
    ·
    a day ago
    ·
    Reply
    1. 1
      
      Mostly experience accumulating over time — but a few rough rules did emerge.
      The most useful one: when two sources conflict, always defer to the more conservative safety classification. If CPSC marks a product as a minor hazard and RAPEX marks the same product as a serious risk, LemonKnows shows serious risk. I'd rather over-warn a user than under-warn them. That's a judgment call that became a hard rule early.
      The second one: treat source identity as a first-class data field. Every record carries a reference to where it came from and when it was ingested. That sounds obvious but most pipelines I've seen strip provenance out early for the sake of clean schemas. Keeping it meant I could always trace a weird data anomaly back to its origin and fix it at the source rather than patching symptoms downstream.
      Beyond that — honestly, it was mostly pattern recognition built from staring at enough malformed records to develop instincts. There's no framework that tells you whether "Acme Pro 2000" and "ACME Professional Series II" are the same product. You just eventually get a feel for when to merge and when to keep them separate.
      The humbling part is that framework is entirely in my head right now. That's fine when it's just me. It becomes a real problem the moment anyone else touches the pipeline.
      
      frankzero
      
      ·
      a day ago
      ·
      Reply
      1. 1
        
        I really like those rules, especially keeping provenance attached to every record.
        It reminds me that a lot of the hardest decisions in growing products start as tacit knowledge. They work because one person has developed good instincts over thousands of small cases, but eventually those instincts become a bottleneck if they never get turned into something explicit.I wonder if that's the next scaling challenge. Not processing more data, but capturing enough of your reasoning that someone else can make the same judgment without losing quality. That's a fascinating problem in itself.
        
        FounderFlow_57
        
        ·
        18 hours ago
        ·
        Reply
        
        1
        
        You've named the next problem more precisely than I had.
        I've been thinking of it vaguely as "documentation" but that's not really it. Documentation captures what the rules are. What's harder to capture is the reasoning behind why a specific edge case got resolved the way it did — and that's exactly what someone else would need to make the same call without coming to me.
        What I probably need is closer to a decision log than a documentation page. Every time I make a non-obvious merge or separation call, record the input, the reasoning, and the outcome. Over enough cases that log becomes a training set — either for a future team member or, eventually, for a model that can handle first-pass normalization automatically.
        The irony is I'm building a platform that helps consumers trust data, and the least trustworthy part of the whole system right now is the normalization layer that only works because I personally understand it. That tension is going to force the issue sooner or later.
        
        frankzero
        
        ·
        17 hours ago
        ·
        Reply
        
        1
        
        I think that's a really interesting distinction.
        A documentation page tells people what happened. A decision log captures how someone thought when there wasn't an obvious answer.That's the part that usually disappears as products mature. Six months later everyone can see the final rule, but nobody remembers the trade-offs that led to it, so the same edge cases keep getting debated.It also feels like that decision log becomes valuable far beyond onboarding. Over time it's building a dataset of expert judgment rather than just a knowledge base, which seems like a much stronger foundation if you ever want AI to assist with first-pass normalization.I'm curious whether you've thought about capturing a confidence score alongside those decisions. It seems like separating "I'm certain these should merge" from "this was a judgment call" could be just as valuable as the reasoning itself.
        
        FounderFlow_57
        
        ·
        16 hours ago
        ·
        Reply
        
        1
        
        The confidence score idea is genuinely good and I hadn't thought about it that way.
        I was imagining the decision log as binary — merge or separate, with reasoning attached. But confidence is actually the most valuable signal. A certain decision and an uncertain one can produce the same outcome while carrying completely different amounts of risk. It also changes how you audit — low-confidence calls from six months ago are exactly the ones worth revisiting when you have more data.
        The harder part is honesty. There's a natural pressure as a solo founder to project certainty because uncertainty feels like a weakness in your own system. Building a process that treats "I'm not sure" as a first-class data point rather than something to resolve before logging it is a shift — even when the culture is just one person.
        
        frankzero
        
        ·
        16 hours ago
        ·
        Reply
        
        1
        
        I think confidence also changes what kind of review a decision needs.
        A low-confidence merge shouldn't necessarily be treated as probably wrong. It should be treated as worth revisiting when new evidence appears.That turns uncertainty from something to eliminate into something the system can actively learn from over time. Feels like a much healthier model than forcing every decision into a binary right or wrong bucket.
        
        FounderFlow_57
        
        ·
        2 hours ago
        ·
        Reply
1

Impressive

waldo_van_der_code

·
2 days ago
·
Reply
1. 1
  
  Thanks
  
  frankzero
  
  ·
  2 days ago
  ·
  Reply
1

What stood out wasn't the architecture.

It was the tradeoffs you were willing to make.

Eventually consistent counts, cached translations, precomputed data... all of them sacrifice something in exchange for scale.

I'd be curious which compromise ended up feeling uncomfortable at first but later turned out to be the right decision. Those are usually the choices that shape the product more than the infrastructure itself.

aryan_sinh

·
3 days ago
·
Reply
1. 1
  
  Honestly, the COUNT(*) one was the hardest to make peace with.
  
  There's something psychologically uncomfortable about showing a user a number you know is slightly stale. It felt like lying, even when the delta was maybe 3-4
  products off on a 1.7M row dataset. Statistically irrelevant. Emotionally? It bugged
  me for weeks.
  
  What flipped it for me was reframing the question. I stopped asking "is this number
  accurate?" and started asking "does this number need to be accurate for the user to
  make a safe decision?"
  
  On a consumer safety platform, someone checking whether a car seat has recalls doesn't need a live count of how many automotive products exist in the database. They need the recall data to be accurate. That's where real-time precision actually matters — and that's exactly where I didn't cut corners.
  
  So the compromise ended up clarifying the product's actual contract with its users.
  Not "everything here is live and precise" but "the safety data is precise, the
  surrounding metadata is fast." That distinction shaped how I think about every
  subsequent tradeoff.
  
  You're right that it's rarely the infrastructure choices that define a product. It's
  what you decide is allowed to be imperfect.
  
  frankzero
  
  ·
  3 days ago
  ·
  Reply
  1. 1
    
    That's actually close to what caught my attention.
    
    I don't think it's really an infrastructure question anymore.
    
    Happy to share the fuller thought if useful. What's the best email to reach you on?
    
    aryan_sinh
    
    ·
    3 days ago
    ·
    Reply
    1. 1
      
      Thanks, I really appreciate it. You can reach me at [email protected].
      
      frankzero
      
      ·
      3 days ago
      ·
      Reply
      1. 1
        
        Sent over the fuller context by email.
        
        aryan_sinh
        
        ·
        2 days ago
        ·
        Reply
1

I am really looking forward to your feedback on how I can make this system even better.

frankzero

·
3 days ago
·
Reply