json - 将多个列表列转换为pyspark数据框中的json数组列
问题描述
我有一个数据框,它有多个列表列并转换一个 JSON 数组列。
使用低于逻辑但没有任何想法?
def test(test1,test2):
d = {'data': [{'marks': a, 'grades': t} for a, t in zip(test1, test2)]}
return d
UDF 定义为如下的数组类型并尝试使用列调用但没有解决任何想法?
arrayToMapUDF = udf(test ,ArrayType(StringType()))
df.withcolumn("jsonarraycolumn", arrayToMapUDF(col("col"),col("col2")))
分数 | 成绩 |
---|---|
[100、150、200、300、400] | [0.01, 0.02, 0.03, 0.04, 0.05] |
需要转换如下。
分数 | 成绩 | Json 数组列 |
---|---|---|
[100、150、200、300、400] | [0.01, 0.02, 0.03, 0.04, 0.05] | {属性:[{标记:1000, |
等级:0.01}, | ||
{标记:15000, | ||
等级:0.02}, | ||
{标记:2000, | ||
成绩:0.03} | ||
]} |
解决方案
您可以使用StringType
它,因为它返回的是 JSON 字符串,而不是字符串数组。您还可以使用json.dumps
将字典转换为 JSON 字符串。
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import json
def test(test1,test2):
d = [{'amount': a, 'discount': t} for a, t in zip(test1, test2)]
return json.dumps(d)
arrayToMapUDF = F.udf(test, StringType())
df2 = df.withColumn("jsonarraycolumn", arrayToMapUDF(F.col("amount"), F.col("discount")))
df2.show(truncate=False)
+-------------------------------+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|amount |discount |jsonarraycolumn |
+-------------------------------+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1000, 15000, 2000, 3000, 4000]|[0.01, 0.02, 0.03, 0.04, 0.05]|[{"amount": 1000, "discount": 0.01}, {"amount": 15000, "discount": 0.02}, {"amount": 2000, "discount": 0.03}, {"amount": 3000, "discount": 0.04}, {"amount": 4000, "discount": 0.05}]|
+-------------------------------+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
如果你不想要引号,
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import json
def test(test1,test2):
d = [{'amount': a, 'discount': t} for a, t in zip(test1, test2)]
return json.dumps(d).replace('"', '')
arrayToMapUDF = F.udf(test, StringType())
df2 = df.withColumn("jsonarraycolumn", arrayToMapUDF(F.col("amount"), F.col("discount")))
df2.show(truncate=False)
+-------------------------------+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|amount |discount |jsonarraycolumn |
+-------------------------------+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1000, 15000, 2000, 3000, 4000]|[0.01, 0.02, 0.03, 0.04, 0.05]|[{amount: 1000, discount: 0.01}, {amount: 15000, discount: 0.02}, {amount: 2000, discount: 0.03}, {amount: 3000, discount: 0.04}, {amount: 4000, discount: 0.05}]|
+-------------------------------+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
如果你想要一个真正的 JSON 类型的列:
def test(test1,test2):
d = [{'amount': a, 'discount': t} for a, t in zip(test1, test2)]
return d
arrayToMapUDF = F.udf(test,
ArrayType(
StructType([
StructField('amount', StringType()),
StructField('discount', StringType())
])
)
)
df2 = df.withColumn("jsonarraycolumn", arrayToMapUDF(F.col("amount"), F.col("discount")))
df2.show(truncate=False)
+-------------------------------+------------------------------+-----------------------------------------------------------------------+
|amount |discount |jsonarraycolumn |
+-------------------------------+------------------------------+-----------------------------------------------------------------------+
|[1000, 15000, 2000, 3000, 4000]|[0.01, 0.02, 0.03, 0.04, 0.05]|[[1000, 0.01], [15000, 0.02], [2000, 0.03], [3000, 0.04], [4000, 0.05]]|
+-------------------------------+------------------------------+-----------------------------------------------------------------------+
df2.printSchema()
root
|-- amount: array (nullable = false)
| |-- element: integer (containsNull = false)
|-- discount: array (nullable = false)
| |-- element: double (containsNull = false)
|-- jsonarraycolumn: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- amount: string (nullable = true)
| | |-- discount: string (nullable = true)
推荐阅读
- python - 收到 OSError: [Errno 28] 设备上没有剩余空间后恢复 pickle 损坏的文件
- google-cloud-platform - Terraform:无法导入 Google VPC 网络
- linux-kernel - 当 dma_alloc_coherent 替换为 dma_map_single 时,DMA 写入失败
- jquery - 更改位置后,滚动到带有动画的元素
- c# - 从用户控件读取数据的问题
- node.js - 帖子的 Nuxt 问题
- javascript - 在 Vue.js 中,如何检测 created() 挂钩中的代码是否在浏览器中运行?
- oracle - 使用 sid 到异构远程代理的 Oracle 丢失 rpc 连接
- c++ - C ++通过访问器函数返回私有二维数组
- jenkins - Jenkins 构建无法从 SVN 存储库签出代码