首页 > 解决方案 > 在 pandas 中使用 pd.get_dummies 转换有和没有唯一分隔符的分类特征

问题描述

关于目标的详细信息

我正在尝试在 pandas 中使用 pd.get_dummies 将分类特征转换为具有虚拟/指标变量的数据帧,分别针对三种不同的类型、人口统计和价格。

额外细节

两个有一个分隔符,一个是“,”,另一个是“|”,第三个只有一个选项,它有一个逗号,但这是价格的一部分,而不是分隔符。

总体目标 - 超出此修复

完成后,我想运行一个缩放函数,返回一个 numpy 数组,其中包含从 scikit-learn 到数据的特征 KNN 模型,并计算每个距离的最近邻居。

导入和加载数据集

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

artist = pd.read_csv("artist.csv")

artist.head()

这是当前的数据框

我将其简化为真正的数据框包含数千个名称、流派、价格点和人口统计数据。

数据框:

id |            name              |       genre                              |     price         |     demo                 |      songs     |         bio      |
---+------------------------------+------------------------------------------+-------------------+--------------------------+----------------+------------------+
 1 |           Ace Frehley         |    Classic Rock,Rock Music               |   Call For Fee    |  25-35,35-50,50 +        |
---+------------------------------+------------------------------------------+-------------------+--------------------------+----------------+------------------+
 2 |           Air Supply         | Adult Contemporary, Pop Music            |   Call For Fee    |  35-50, 50 +             |
---+------------------------------+------------------------------------------+-------------------+--------------------------+----------------+------------------+
 3 |           Bebe Rexha         |  Country Music, Hip Hop & Rap, Pop Music |   Call For Fee    |  Undefined               |
---+------------------------------+------------------------------------------+-------------------+--------------------------+----------------+------------------+
 4 |           Blanco Brown       |           Hip Hop & Rap, R&B             |   Call For Fee    |  Undefined               |
---+------------------------------+------------------------------------------+-------------------+--------------------------+----------------+------------------+
 5 |           Cautious Clay      |           Hip Hop & Rap, R&B             |   Call For Fee    |  Undefined               |
---+------------------------------+------------------------------------------+-------------------+--------------------------+----------------+------------------+
 6 |           Andy Samberg       |           Standup Comedy                 |   Call For Fee    |  18-25,25-35,35-50       | 
---+------------------------------+------------------------------------------+-------------------+--------------------------+----------------+------------------+
 7 |           Afrojack           |              DJ's                        |  Under $200,000   |  Under 18,18-25,25-35    |                    
---+------------------------------+------------------------------------------+-------------------+--------------------------+----------------+------------------+
 8 |           Billy Idol         |              Classic Rock                |  Under $200,000   |  25-35,35-50,50 +        |
---+------------------------------+------------------------------------------+-------------------+--------------------------+----------------+------------------+


artist.isnull().sum()

我在这里读到了pandas.get_dummies 并尝试了一些不同的方法,但没有成功。

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

我试过的

artist1["genre"] = artist1["genre"].astype(float)


artist1_features = pd.concat([artist1['genre'].str.get_dummies(sep="| "),
                              pd.concat([artist1['demo'].str.get_dummies(sep=","),
                              pd.get_dummies(artist1[['price']]),axis=1)
artist1["name"] = artist1["name"].map(lambda name:re.sub(''[^A-Za0-9]+', " ", name))                              
artist1_features.head()

我也试过这个

artist1["genre"] = artist1["genre"].astype(float)
artist1_processed = pd.get_dummies(metadata['genre']).str.get_dummies(sep="| ")
artist1_concat = pd.concat([artist1_processed, metadata], axis=1)
pd.get_dummies(artists1[["genre"]]).head()

我得到的错误

在此处输入图像描述

目标

理想情况下,我想使用pd.get_dummies,这是一种 pandas 方法,用于将分类特征转换为具有虚拟/指标变量的数据帧,分别针对每种类型、人口统计和价格。

流派 基本上有这样的分隔符“|” - 例如:乡村音乐| 嘻哈与说唱| 流行音乐

人口统计数据 基本上有这样的分隔符“|” - 例如:18,18-25,25-35 以下

价格 不需要分隔符,但有逗号 - 例如:低于 $200,000

我正在将一些不同的电影数据库推荐系统教程中的想法应用到一个真实的项目中。

完成后应该如下所示。

预期成绩

我正在尝试做的事情:

类型:

id |            name              | Adult Contemporary | Classic Rock | Country Music | DJ's | Standup Comedy | Pop Music | Rock Music | Hip Hop & Rap | R&B |
---+------------------------------+--------------------+--------------+---------------+------+----------------+-----------+------------+---------------+-----+
 1 |           Ace Frehley         |           0        |        1     |       0       |  0   |       0        |        0  |      1     |         0     |  0  | 
---+------------------------------+--------------------+--------------+---------------+------+----------------+-----------+------------+---------------+-----+
 2 |           Air Supply         |           1        |        0     |       0       |  0   |       0        |        1  |      0     |         0     |  0  | 
---+------------------------------+--------------------+--------------+---------------+------+----------------+-----------+------------+---------------+-----+
 3 |           Bebe Rexha         |           0        |        0     |       1       |  0   |       0        |        1  |      0     |         1     |  0  | 
---+------------------------------+--------------------+--------------+---------------+------+----------------+-----------+------------+---------------+-----+
 4 |           Blanco Brown       |           1        |        0     |       1       |  0   |       0        |        0  |      0     |         1     |  0  | 
---+------------------------------+--------------------+--------------+---------------+------+----------------+-----------+------------+---------------+-----+
 5 |           Cautious Clay      |           0        |        0     |       0       |  0   |       0        |        0  |      0     |         1     |  1  | 
---+------------------------------+--------------------+--------------+---------------+------+----------------+-----------+------------+---------------+-----+
 6 |           Andy Samberg       |           0        |        0     |       1       |  0   |       1        |        0  |      0     |         1     |  0  | 
---+------------------------------+--------------------+--------------+---------------+------+----------------+-----------+------------+---------------+-----+
 7 |           Afrojack           |           0        |        0     |       0       |  1   |       0        |        0  |      0     |         0     |  0  | 
---+------------------------------+--------------------+--------------+---------------+------+----------------+-----------+------------+---------------+-----+
 8 |           Billy Idol         |           0        |        1     |       0       |  0   |       0        |        0  |      0     |         0     |  0  | 
---+------------------------------+--------------------+--------------+---------------+------+----------------+-----------+------------+---------------+-----+

人口统计:


id |            name              | Under 18 | 18-25 | 25-35 | 35-50 | 50 + | Undefined |
---+------------------------------+----------+-------+-------+-------+------+-----------+
 1 |        Ace Frehley           |     0    |   0   |   1   |   1   |   1  |    0      |
---+------------------------------+----------+-------+-------+-------+------+-----------+
 2 |        Air Supply            |     0    |   0   |   0   |   1   |   1  |    0      |
---+------------------------------+----------+-------+-------+-------+------+-----------+
 3 |        Bebe Rexha            |     0    |   0   |   0   |   0   |   0  |    1      |    
---+------------------------------+----------+-------+-------+-------+------+-----------+
 4 |            Blanco Brown      |     0    |   0   |   0   |   0   |   0  |    1      | 
---+------------------------------+----------+-------+-------+-------+------+-----------+
 5 |            Cautious Clay     |     0    |   0   |   1   |   1   |   1  |    1      | 
---+------------------------------+----------+-------+-------+-------+------+-----------+
 6 |            Andy Samberg      |     0    |   1   |   1   |   1   |   0  |    0      | 
---+------------------------------+----------+-------+-------+-------+------+-----------+
 7 |            Afrojack          |     1    |   1   |   1   |   0   |   0  |    0      |  
---+------------------------------+----------+-------+-------+-------+------+-----------+
 8 |            Billy Idol        |     0    |   0   |   1   |   1   |   1  |    0      | 
---+------------------------------+----------+-------+-------+-------+------+-----------+

价格:

id |            name              | Call For Fee | Under $15,000 | Under $25,000 | Under $50,000 | Under $75,000 | Under $100,000 | Under $150,000 | Under $200,000 |
---+------------------------------+--------------+---------------+---------------+---------------+---------------+----------------+----------------+----------------+
 1 |        Ace Frehley           |       1      |       0       |       0       |       0       |       0       |        0       |        0       |        0       |
---+------------------------------+--------------+---------------+---------------+---------------+---------------+----------------+----------------+----------------+
 2 |        Air Supply            |       0      |       0       |       0       |       0       |       0       |        1       |        0       |        0       |
---+------------------------------+--------------+---------------+---------------+---------------+---------------+----------------+----------------+----------------+
 3 |        Bebe Rexha            |       1      |       0       |       0       |       0       |       0       |        0       |        0       |        0       |
---+------------------------------+--------------+---------------+---------------+---------------+---------------+----------------+----------------+----------------+
 4 |            Blanco Brown      |       1      |       0       |       0       |       0       |       0       |        0       |        0       |        0       | 
---+------------------------------+--------------+---------------+---------------+---------------+---------------+----------------+----------------+----------------+
 5 |            Cautious Clay     |       1      |       0       |       0       |       0       |       0       |        0       |        0       |        0       |
---+------------------------------+--------------+---------------+---------------+---------------+---------------+----------------+----------------+----------------+
 6 |            Andy Samberg      |       1      |       0       |       0       |       0       |       0       |        0       |        0       |        0       |
---+------------------------------+--------------+---------------+---------------+---------------+---------------+----------------+----------------+----------------+
 7 |            Afrojack          |       0      |       0       |       0       |       0       |       0       |        0       |        0       |        1       | 
---+------------------------------+--------------+---------------+---------------+---------------+---------------+----------------+----------------+----------------+
 8 |            Billy Idol        |       0      |       0       |       0       |       0       |       0       |        0       |        0       |        1       | 
---+------------------------------+--------------+---------------+---------------+---------------+---------------+----------------+----------------+----------------+

完成后,我想运行一个缩放函数,返回一个 numpy 数组,其中包含从 scikit-learn 到数据的特征 KNN 模型,并计算每个距离的最近邻居。

标签: pythonpandasdataframenlpdata-manipulation

解决方案


修复你的输出

artist1_features = pd.concat([artist1['genre'].str.get_dummies(sep="| "), 
                              artist1['demo'].str.get_dummies(sep=","),
                              pd.crosstab(artist1.index, artist1['price']),axis = 1)

推荐阅读