In this chapter, I'm getting a bit technical, explaining the difficulties Price2Spy’s team had to overcome when building the ML model for product matching.
Computation size - we’re talking about comparing SET A to Set B
Diverse training data sources (websites from different languages, industries, product assortments, and product naming conventions)
Hugely unbalanced positive and negative labels in the training set
After matches get scored, complex post-processing will be needed, in order to determine the best matching candidates
Label noise – matches supplied in the training set are not 100% accurate:
Data duplication in training set – due to the fact that websites can have products listed in multiple categories, with multiple product URLs. Let’s suppose that Set A has 1 product duplicate, and Set B has 1 product triplicate – this leads to potentially 6 identical matches in the training set, which will be very misleading for an ML algorithm
Difficult to evaluate – we have used precision-recall curves in order to evaluate the model performance. However, due to label noise, we had to manually evaluate the results Python vs Java – while our external consultant was working in Python, we had to translate all the code into Java (Price2Spy’s standard technology)