python - 在 pandas 中使用 pd.get_dummies 转换有和没有唯一分隔符的分类特征
问题描述
关于目标的详细信息
我正在尝试在 pandas 中使用 pd.get_dummies 将分类特征转换为具有虚拟/指标变量的数据帧,分别针对三种不同的类型、人口统计和价格。
额外细节
两个有一个分隔符,一个是“,”,另一个是“|”,第三个只有一个选项,它有一个逗号,但这是价格的一部分,而不是分隔符。
总体目标 - 超出此修复
完成后,我想运行一个缩放函数,返回一个 numpy 数组,其中包含从 scikit-learn 到数据的特征 KNN 模型,并计算每个距离的最近邻居。
导入和加载数据集
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
artist = pd.read_csv("artist.csv")
artist.head()
这是当前的数据框
我将其简化为真正的数据框包含数千个名称、流派、价格点和人口统计数据。
数据框:
id | name | genre | price | demo | songs | bio |
---+------------------------------+------------------------------------------+-------------------+--------------------------+----------------+------------------+
1 | Ace Frehley | Classic Rock,Rock Music | Call For Fee | 25-35,35-50,50 + |
---+------------------------------+------------------------------------------+-------------------+--------------------------+----------------+------------------+
2 | Air Supply | Adult Contemporary, Pop Music | Call For Fee | 35-50, 50 + |
---+------------------------------+------------------------------------------+-------------------+--------------------------+----------------+------------------+
3 | Bebe Rexha | Country Music, Hip Hop & Rap, Pop Music | Call For Fee | Undefined |
---+------------------------------+------------------------------------------+-------------------+--------------------------+----------------+------------------+
4 | Blanco Brown | Hip Hop & Rap, R&B | Call For Fee | Undefined |
---+------------------------------+------------------------------------------+-------------------+--------------------------+----------------+------------------+
5 | Cautious Clay | Hip Hop & Rap, R&B | Call For Fee | Undefined |
---+------------------------------+------------------------------------------+-------------------+--------------------------+----------------+------------------+
6 | Andy Samberg | Standup Comedy | Call For Fee | 18-25,25-35,35-50 |
---+------------------------------+------------------------------------------+-------------------+--------------------------+----------------+------------------+
7 | Afrojack | DJ's | Under $200,000 | Under 18,18-25,25-35 |
---+------------------------------+------------------------------------------+-------------------+--------------------------+----------------+------------------+
8 | Billy Idol | Classic Rock | Under $200,000 | 25-35,35-50,50 + |
---+------------------------------+------------------------------------------+-------------------+--------------------------+----------------+------------------+
artist.isnull().sum()
我在这里读到了pandas.get_dummies 并尝试了一些不同的方法,但没有成功。
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
我试过的
artist1["genre"] = artist1["genre"].astype(float)
artist1_features = pd.concat([artist1['genre'].str.get_dummies(sep="| "),
pd.concat([artist1['demo'].str.get_dummies(sep=","),
pd.get_dummies(artist1[['price']]),axis=1)
artist1["name"] = artist1["name"].map(lambda name:re.sub(''[^A-Za0-9]+', " ", name))
artist1_features.head()
我也试过这个
artist1["genre"] = artist1["genre"].astype(float)
artist1_processed = pd.get_dummies(metadata['genre']).str.get_dummies(sep="| ")
artist1_concat = pd.concat([artist1_processed, metadata], axis=1)
pd.get_dummies(artists1[["genre"]]).head()
我得到的错误
目标
理想情况下,我想使用pd.get_dummies,这是一种 pandas 方法,用于将分类特征转换为具有虚拟/指标变量的数据帧,分别针对每种类型、人口统计和价格。
流派 基本上有这样的分隔符“|” - 例如:乡村音乐| 嘻哈与说唱| 流行音乐
人口统计数据 基本上有这样的分隔符“|” - 例如:18,18-25,25-35 以下
价格 不需要分隔符,但有逗号 - 例如:低于 $200,000
我正在将一些不同的电影数据库推荐系统教程中的想法应用到一个真实的项目中。
完成后应该如下所示。
预期成绩
我正在尝试做的事情:
类型:
id | name | Adult Contemporary | Classic Rock | Country Music | DJ's | Standup Comedy | Pop Music | Rock Music | Hip Hop & Rap | R&B |
---+------------------------------+--------------------+--------------+---------------+------+----------------+-----------+------------+---------------+-----+
1 | Ace Frehley | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
---+------------------------------+--------------------+--------------+---------------+------+----------------+-----------+------------+---------------+-----+
2 | Air Supply | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
---+------------------------------+--------------------+--------------+---------------+------+----------------+-----------+------------+---------------+-----+
3 | Bebe Rexha | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
---+------------------------------+--------------------+--------------+---------------+------+----------------+-----------+------------+---------------+-----+
4 | Blanco Brown | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
---+------------------------------+--------------------+--------------+---------------+------+----------------+-----------+------------+---------------+-----+
5 | Cautious Clay | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
---+------------------------------+--------------------+--------------+---------------+------+----------------+-----------+------------+---------------+-----+
6 | Andy Samberg | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
---+------------------------------+--------------------+--------------+---------------+------+----------------+-----------+------------+---------------+-----+
7 | Afrojack | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
---+------------------------------+--------------------+--------------+---------------+------+----------------+-----------+------------+---------------+-----+
8 | Billy Idol | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
---+------------------------------+--------------------+--------------+---------------+------+----------------+-----------+------------+---------------+-----+
人口统计:
id | name | Under 18 | 18-25 | 25-35 | 35-50 | 50 + | Undefined |
---+------------------------------+----------+-------+-------+-------+------+-----------+
1 | Ace Frehley | 0 | 0 | 1 | 1 | 1 | 0 |
---+------------------------------+----------+-------+-------+-------+------+-----------+
2 | Air Supply | 0 | 0 | 0 | 1 | 1 | 0 |
---+------------------------------+----------+-------+-------+-------+------+-----------+
3 | Bebe Rexha | 0 | 0 | 0 | 0 | 0 | 1 |
---+------------------------------+----------+-------+-------+-------+------+-----------+
4 | Blanco Brown | 0 | 0 | 0 | 0 | 0 | 1 |
---+------------------------------+----------+-------+-------+-------+------+-----------+
5 | Cautious Clay | 0 | 0 | 1 | 1 | 1 | 1 |
---+------------------------------+----------+-------+-------+-------+------+-----------+
6 | Andy Samberg | 0 | 1 | 1 | 1 | 0 | 0 |
---+------------------------------+----------+-------+-------+-------+------+-----------+
7 | Afrojack | 1 | 1 | 1 | 0 | 0 | 0 |
---+------------------------------+----------+-------+-------+-------+------+-----------+
8 | Billy Idol | 0 | 0 | 1 | 1 | 1 | 0 |
---+------------------------------+----------+-------+-------+-------+------+-----------+
价格:
id | name | Call For Fee | Under $15,000 | Under $25,000 | Under $50,000 | Under $75,000 | Under $100,000 | Under $150,000 | Under $200,000 |
---+------------------------------+--------------+---------------+---------------+---------------+---------------+----------------+----------------+----------------+
1 | Ace Frehley | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
---+------------------------------+--------------+---------------+---------------+---------------+---------------+----------------+----------------+----------------+
2 | Air Supply | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
---+------------------------------+--------------+---------------+---------------+---------------+---------------+----------------+----------------+----------------+
3 | Bebe Rexha | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
---+------------------------------+--------------+---------------+---------------+---------------+---------------+----------------+----------------+----------------+
4 | Blanco Brown | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
---+------------------------------+--------------+---------------+---------------+---------------+---------------+----------------+----------------+----------------+
5 | Cautious Clay | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
---+------------------------------+--------------+---------------+---------------+---------------+---------------+----------------+----------------+----------------+
6 | Andy Samberg | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
---+------------------------------+--------------+---------------+---------------+---------------+---------------+----------------+----------------+----------------+
7 | Afrojack | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
---+------------------------------+--------------+---------------+---------------+---------------+---------------+----------------+----------------+----------------+
8 | Billy Idol | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
---+------------------------------+--------------+---------------+---------------+---------------+---------------+----------------+----------------+----------------+
完成后,我想运行一个缩放函数,返回一个 numpy 数组,其中包含从 scikit-learn 到数据的特征 KNN 模型,并计算每个距离的最近邻居。
解决方案
修复你的输出
artist1_features = pd.concat([artist1['genre'].str.get_dummies(sep="| "),
artist1['demo'].str.get_dummies(sep=","),
pd.crosstab(artist1.index, artist1['price']),axis = 1)
推荐阅读
- delphi - Delphi Enums 到 Variant 作为 varInteger 而不是 varUInt32
- matlab - 使用 parfor 循环并行化 fminsearch?
- azure - 使用标准用户帐户下载基于订阅的邮件
- python - Computing the Cosine Similarity of two sets of vectors in Tensorflow
- python - 是否保证对 Pandas 数据框中的级别列表进行排序?
- ag-grid - 是否可以为 ag-grid 中的树数据组提供组件作为单元格渲染器?
- python - 将嵌套字典替换为空数据框
- javascript - 实施护照,收到错误的请求
- php - laravel 和 laravel 宅基地有什么区别
- javascript - 网页转换策略