首页 > 解决方案 > Pyspark 中数组元素上的 UDF

问题描述

我有一个如下所示的数据框

col1
------
[{"a":"1","b":"2"},{"a":"11,"b":"22"}]

现在我想使用现有值包含新结构 {"cc": "1" } --> 这里 1 来自 "a": "1"

col1
------
[{"a":"1","b":"2", {"cc": "1" }},{"a":"11,"b":"22",{"cc": "11" } }]  

请向我推荐pyspark中的udf,

标签: apache-sparkpyspark

解决方案


您可以使用转换功能(来自 spark V2.4)来获得所需的结果。

from pyspark.sql import *
from pyspark.sql.functions import *

spark = SparkSession.builder.master('local[*]').getOrCreate()

df = spark.createDataFrame([('[{"a":"1","b":"2"},{"a":"11","b":"22"}]',)],"col1 string")

df.withColumn("col1", from_json("col1", schema_of_json(df.select("col1").first()[0]))).\
    selectExpr("to_json(transform(col1, x-> "
               "struct(x.a as a, x.b as b, struct(x.a as cc) as cc))) as co1").\
    show(truncate=False)

    +------------------------------------------------------------------------+
    |co1                                                                     |
    +------------------------------------------------------------------------+
    |[{"a":"1","b":"2","cc":{"cc":"1"}},{"a":"11","b":"22","cc":{"cc":"11"}}]|
    +------------------------------------------------------------------------+

推荐阅读