首页 > 解决方案 > 如何将一列中的字典列表拆分为pyspark数据框中的两列?

问题描述

在此处输入图像描述我想将上面火花数据帧的过滤地址列拆分为两个新列,即标志和地址:

customer_id|pincode|filteredaddress|                                                              Flag| Address
1000045801 |121005 |[{'flag':'0', 'address':'House number 172, Parvatiya Colony Part-2 , N.I.T'}]
1000045801 |121005 |[{'flag':'1', 'address':'House number 172, Parvatiya Colony Part-2 , N.I.T'}]
1000045801 |121005 |[{'flag':'1', 'address':'House number 172, Parvatiya Colony Part-2 , N.I.T'}]

谁能告诉我我该怎么做?

标签: pythonapache-sparkpysparkapache-spark-sql

解决方案


您可以filteredaddress使用键从地图列中获取值:

df2 = df.selectExpr(
    'customer_id', 'pincode',
    "filteredaddress['flag'] as flag", "filteredaddress['address'] as address"
)

访问地图值的其他方法是:

import pyspark.sql.functions as F

df.select(
    'customer_id', 'pincode',
    F.col('filteredaddress')['flag'],
    F.col('filteredaddress')['address']
)

# or, more simply

df.select(
    'customer_id', 'pincode',
    'filteredaddress.flag',
    'filteredaddress.address'
)

推荐阅读