首页 > 解决方案 > Pandas:从列内的值创建虚拟变量

问题描述

我有一个数据框,其中有一列称为Actors每个单元格都包含这样的字符串"Abigail Breslin, Greg Kinnear, Paul Dano, Alan Arkin"。我希望拆分此字符串,(",")以便单元格包含每个参与者的列表,即["Abigail Breslin", "Greg Kinnear, "Paul Dano, "Alan Arkin"]我可以为每个唯一参与者创建虚拟变量。我还没有找到一个解决方案,它实际上将字符串分开并将相应的演员名称发送到一个新列中。

任何帮助将不胜感激:)

我的数据框(df)看起来像这样

Title (Object)| Actors (Object)                                              |  Year (Object)    
Pulp Fiction  | Bruce Willis, Amanda Plummer, Laura Lovelace, John Travolta  |  1994
Fight Club    | Edward Norton, Brad Pitt, Helena Bonham Carter, Meat Loaf    |  1999

我的目标是让我的数据框看起来像这样

Title (Object)| Bruce Willis | Amanda Plummer | Laura Lovelace | John Travolta | Edward Norton | Year   
Pulp Fiction  |       1      |        1       |       1        |      1        |       0       | 1994
Fight Club    |       0      |        0       |       0        |      0        |       1       | 1999

我努力了

import pandas as pd 

data = 'Imdb_datajson(Cleaned).csv'

df = pd.read_csv(data)
    list_of_unique_actors = df.Actors.unique().tolist()
    list_of_unique_actors
    
    newlist = []
    for actor in list_of_unique_actors:
        actor = actor.split(",")
        newlist.extend(actor)

并收到此错误

    AttributeError                            Traceback (most recent call last)
<ipython-input-48-ae50a804fe05> in <module>
      5 newlist = []
      6 for word in list_of_unique_actors:
----> 7     word = word.split(",")
      8     newlist.extend(word)
      9 return newlist

AttributeError: 'float' object has no attribute 'split'

标签: pythonpandasdataframesplitdummy-variable

解决方案


利用pd.get_dummies()

# sample data
s = """Title (Object)|Actors (Object)|Year (Object)
Pulp Fiction|Bruce Willis, Amanda Plummer, Laura Lovelace, John Travolta|1994
Fight Club|Edward Norton, Brad Pitt, Helena Bonham Carter, Meat Loaf|1999"""
# read csv
df = pd.read_csv(StringIO(s), sep='|')

# split your string of actors into a list
df['Actors (Object)'] = df['Actors (Object)'].str.split(', ')
# set the title and year as index
df = df.set_index(['Title (Object)', 'Year (Object)'])
# get_dummies
dummy_df = pd.get_dummies(df['Actors (Object)'].apply(pd.Series).stack()).sum(level=[0,1])


                               Edward Norton  Amanda Plummer  Brad Pitt  \
Title (Object) Year (Object)                                              
Pulp Fiction   1994                        0               1          0   
Fight Club     1999                        1               0          1   

                              Bruce Willis  Helena Bonham Carter  \
Title (Object) Year (Object)                                       
Pulp Fiction   1994                      1                     0   
Fight Club     1999                      0                     1   

                              John Travolta  Laura Lovelace  Meat Loaf  
Title (Object) Year (Object)                                            
Pulp Fiction   1994                       1               1          0  
Fight Club     1999                       0               0          1  

推荐阅读