首页 > 解决方案 > Python/ML: Which methods to use for Multiclass Classification for Product Categorization?

问题描述

In a pickle...

I have a dataset with >100,000 observations; datasets' columns include CustomerID, VendorID, ProductID and CatNMap. Here is what it looks like:

enter image description here

As you can see values represented in first 3 columns (CustomerID, VendorID, ProductID) represent unique numerical mapped values and would make no sense if represented on x,y plane (which eliminates use of a lot of Classification methods); last column has strings with categories assigned by customers. Now, here is the part that I do not understand and not sure how to approach...

Goal: is to predict CatNMap values in the future for customers, however as I see it the features I have here are not useful, is that true? Now if they are, what method can I use as CatNMap column has >7,000 unique values; also, how would any method deal with categorizing future items if let's say for the same product there are 2 or more different categories assigned by different customers? Do I need to Implement NN for this one?

All answers are appreciated!

标签: pythonmachine-learningneural-networkclassificationmulticlass-classification

解决方案


据我了解,您的目标是CatNMap根据前 3 列(您的输入数据作为特征)预测(您的输出数据)。

正如您之前所说, ( CustomerID, VendorID, ProductID) 是 3 个分类变量,这意味着它们可能具有的值与数量无关,而是与类别有关。所以两个连续的值可能与它们的实际含义无关。正如我所看到的,您的 output 也会发生同样的情况CatNMap

话虽如此,有几种方法可以处理分类变量。根据我的经验,对于您的问题,我会为您的所有数据尝试一个热编码CustomerID, VendorID, ProductID, CatNMap)。更重要的是,如果您发现可能的话,也许值得尝试使用嵌入ProductID, CatNMap不是 OneHotEncoding。

至于使用哪种算法,绝对值得尝试训练随机森林和多层感知器模型,并在调整后进行比较。

我发现本指南很有用,您可以在其中查看一些示例,但还有许多其他资源可以处理此主题。你也应该看看这个


推荐阅读