August 14, 2020

(Part #3) ML – why is product matching difficult?

Misha Krunic @mkrunic

In this chapter, I'm getting a bit technical, explaining the difficulties Price2Spy’s team had to overcome when building the ML model for product matching.

  • Computation size - we’re talking about comparing SET A to Set B

  • Diverse training data sources (websites from different languages, industries, product assortments, and product naming conventions)

  • Hugely unbalanced positive and negative labels in the training set

  • After matches get scored, complex post-processing will be needed, in order to determine the best matching candidates

  • Label noise – matches supplied in the training set are not 100% accurate:

  • Data duplication in training set – due to the fact that websites can have products listed in multiple categories, with multiple product URLs. Let’s suppose that Set A has 1 product duplicate, and Set B has 1 product triplicate – this leads to potentially 6 identical matches in the training set, which will be very misleading for an ML algorithm

  • Difficult to evaluate – we have used precision-recall curves in order to evaluate the model performance. However, due to label noise, we had to manually evaluate the results Python vs Java – while our external consultant was working in Python, we had to translate all the code into Java (Price2Spy’s standard technology)

  1. 2

    This has been a great read. Thank you!

    1. 1

      I'm happy to hear that you enjoyed it :) Stay tuned, more is to come!

Today's Top Milestones
  • We've reached $5,000 MRR and 800 subscribers!
    We just reached $5,073 MRR with 832 paying subscribers! We’ve gone from $415 MRR in April this year (1000% MRR growth in 6 months)! Plausible Analytic
  • 2 new subscribers on the $100 /mo plan
    1 week ago we launched a new service at refmonkey.com called RefMonkey Done-for-you dfy.refmonkey.com where we create done-for-you referral and affili
  • Pophurdle has run on 1M+ pages!!!
    I can't believe it... My lightweight, privacy-based popup blocker broken the million mark. In future updates, Pophurdle will have fully migrated to an
  • 3000 subscribers!
    As of today, 3,000 people are actively receiving the Software Ideas newsletter every week, whether that's the paid version (which has 291 subscription
  • We are making the future of work arrive faster!
    86 jobseekers signed up and 54 got a job last week. 35 business owners posted 74 jobs last week Sign up on http://careermove.io to hire experienced wo
  • 300 eBook downloads: first 100 users
    My new piece of content marketing has hit 300 download in less than a week. The ebook is 100 ideas for you startup's first 100 users and is taken from
  • 1st partnership with deals for our members
    As a part of our product and marketing strategy, we're working in partnerships to strengthen our offering and acquisition channels. In this case, we'v
  • Hit 2000 signups
    Bear Blog now has 2000 blogs running on the platform. You can check out the discovery feed to see what people are writing about. Pretty much all of th
  • Shared a blog post on twitter for the first time
    I tweeted my first tweet and shared my first blog on Twitter. Very minor engagement as I have no following but I tagged the makers of the landing page
  • Site launched for Chrome App
    I used notion and coludflare along with some cool JS code to make this landing page for my free chrome app . The chrome extension is simple - Get cura