python - 将列中的列表拆分为 pyspark 中的一个热编码特征

问题描述

我有一个 pyspark 数据框，如下所示：

ID	tmp_list	其它功能
1	['西班牙'，'意大利']	xxx
2	['西班牙'，'法国'，'美国'，'印度']	年年
3	['西班牙'，'德国']	zzz

以及如下国家列表：

EU_countries = ['Spain', 'Italy', 'France', 'Germany']

我想做以下事情：

从列中提取所有唯一值tmp_list
为中存在的所有值创建新列EU_countries。对于中不存在的值EU_countries，创建一个名为的列other_countries。本质上，为列表中的每个条目创建列EU_countries+ 一个名为other_countries.
如果 anid包含EU_countries列表中的任何国家/地区，则新列的值Spain应1为 else 0。同样适用于EU_countries列表中的其他国家。
如果 an包含列表id中不存在的任何国家/地区，则应填写else列。EU_countriesother_countries10

这是我正在寻找的最终输出：

ID	西班牙	意大利	法国	德国	其他国家	其它功能
1	1	1	0	0	0	xxx
2	1	0	1	0	1	年年
3	1	0	0	1	0	zzz

我为此头破血流。有人可以帮我吗？

任何帮助是极大的赞赏！太感谢了！

标签： pythonpyspark

像我在熊猫中一样推理和工作。

爆炸
创建将非 EU_countries 归为 other_countries 的类别
get_dummies。在这一点上，我对这篇文章表示赞赏

代码如下；

df=df.select('*').withColumn('tmp_list1', F.explode(col('tmp_list')))#Create new column with exploded list
df=df.select('*').withColumn('Cat', when(col('tmp_list1').isin(EU_countries),df.tmp_list1).otherwise('other_countries'))#Create another column Cat
df.groupBy("tmp_list",'other features').pivot("Cat").agg(F.lit(1)).na.fill(0).show()#Get dummies


+---------------------------+--------------+------+-------+-----+-----+---------------+
|tmp_list                   |other features|France|Germany|Italy|Spain|other_countries|
+---------------------------+--------------+------+-------+-----+-----+---------------+
|[Spain, Germany]           |zzz           |0     |1      |0    |1    |0              |
|[Spain, Italy]             |xxx           |0     |0      |1    |1    |0              |
|[Spain, France, USA, India]|yyy           |1     |0      |0    |1    |1              |
+---------------------------+--------------+------+-------+-----+-----+---------------+

python - 将列中的列表拆分为 pyspark 中的一个热编码特征

问题描述

解决方案

推荐阅读