首页 > 解决方案 > 使用新列更新熊猫数据框

问题描述

我想创建一个包含行中所有不同值的新列。一行中的每个值都是一个字符串(不是列表)。

这是数据框的样子:

+-----------------------------+-------------------------+---------------------------------------------+
|         first               |            second       |           third                             |  
+-----------------------------+-------------------------+---------------------------------------------+
|['able', 'shovel', 'door']   |['shovel raised']        |['shovel raised', 'raised', 'door', 'shovel']|
|['grade control']            |['grade']                |['grade']                                    |
|['light telling', 'love']    |['would love', 'closed'] |['closed', 'light']                          |
+-----------------------------+-------------------------+---------------------------------------------+

这是创建具有不同值的新列后数据框的外观。

df = pd.DataFrame({'first': "['able', 'shovel', 'door']" , 'second': "['shovel raised']", 'third': "['shovel raised', 'raised', 'door', 'shovel']", "Distinct_set": "['able', 'shovel', 'door', 'shovel raised', 'raised']" }, index = [0])

我该怎么做?

标签: pythonpandas

解决方案


这个怎么样:

import pandas as pd
import numpy as np

df = pd.DataFrame([[['able', 'shovel', 'door'], ['shovel raised'], ['shovel raised', 'raised', 'door', 'shovel']], [['grade control'], ['grade'], ['grade']], [['light telling', 'love'], ['would love', 'closed'], ['closed', 'light']]], columns=['first', 'second', 'third'])

df.apply(lambda row: [np.unique(np.hstack(row))], raw=True, axis=1)

最后一个命令产生:

0        [[able, door, raised, shovel, shovel raised]]
1                             [[grade, grade control]]
2    [[closed, light, light telling, love, would lo...

可以保存在数据框的新列中:

df['Distinct_set'] = df.apply(lambda row: [np.unique(np.hstack(row))], raw=True, axis=1) 

推荐阅读