首页 > 解决方案 > 将具有数组形状的字符串的 pyspark databricks 数据框转换为标准列

问题描述

我有一个数据块数据框,其中包含大量问卷结果,df 上的问卷长度各不相同,而且问题并不总是相同。

如何从字符串字段响应中获取问题和答案,所以我有一个 3 列列表 "CustomerID, Questions, Answers"

所以从这

CustomerID     Responses
1              [{"question1":"answer 1"},{"question 2":"answer2"}]
2              [{"question1":"answer 1a"},{"question 2":"answer2b"}]
3              [{"question1":"answer 1b"},{"question 3":"answer3"}]

解决

CustomerID   Questions  Answers
1            question1  answer1
1            question2  answer2
2            question1  answer1a
2            question2  answer2b 
3            question1  answer1b 
3            question3  answer3 

标签: pythonapache-sparkpyspark

解决方案


由于您的数据是基于字符串的,而不是基于 JSON 的,因此您必须先使用模式解析它,然后才能分解它

from pyspark.sql import functions as F
from pyspark.sql import types as T

(df
    .withColumn('Responses', F.from_json('Responses', T.ArrayType(T.MapType(T.StringType(), T.StringType()))))
    .withColumn('Response', F.explode('Responses'))
    .withColumn('Question', F.map_keys('Response')[0])
    .withColumn('Answer', F.map_values('Response')[0])
    .drop('Responses', 'Response')
    .show(10, False)
)

# Output
# +----------+----------+---------+
# |CustomerID|Question  |Answer   |
# +----------+----------+---------+
# |1         |question1 |answer 1 |
# |1         |question 2|answer2  |
# |2         |question1 |answer 1a|
# |2         |question 2|answer2b |
# |3         |question1 |answer 1b|
# |3         |question 3|answer3  |
# +----------+----------+---------+

推荐阅读