2
4 Comments

(Part #3) ML – why is product matching difficult?

In this chapter, I'm getting a bit technical, explaining the difficulties Price2Spy’s team had to overcome when building the ML model for product matching.

  • Computation size - we’re talking about comparing SET A to Set B

  • Diverse training data sources (websites from different languages, industries, product assortments, and product naming conventions)

  • Hugely unbalanced positive and negative labels in the training set

  • After matches get scored, complex post-processing will be needed, in order to determine the best matching candidates

  • Label noise – matches supplied in the training set are not 100% accurate:

  • Data duplication in training set – due to the fact that websites can have products listed in multiple categories, with multiple product URLs. Let’s suppose that Set A has 1 product duplicate, and Set B has 1 product triplicate – this leads to potentially 6 identical matches in the training set, which will be very misleading for an ML algorithm

  • Difficult to evaluate – we have used precision-recall curves in order to evaluate the model performance. However, due to label noise, we had to manually evaluate the results Python vs Java – while our external consultant was working in Python, we had to translate all the code into Java (Price2Spy’s standard technology)

  1. 2

    This has been a great read. Thank you!

    1. 1

      I'm happy to hear that you enjoyed it :) Stay tuned, more is to come!

  2. 1

    Great info! It explains the topic in a plain language. Thanks!

  3. 1

    Ohh cool info. I try to do it with a help of ai

Trending on Indie Hackers
How I grew a side project to 100k Unique Visitors in 7 days with 0 audience 47 comments Competing with Product Hunt: a month later 33 comments Why do you hate marketing? 27 comments $15k revenues in <4 months as a solopreneur 14 comments Use Your Product 13 comments How I Launched FrontendEase 13 comments