首页 > 解决方案 > 连接两个数据框 A 和 B 并将非关键列从数据框 B 转换为 Json 字符串

问题描述

我有两个 DataFrames 如下,

数据框 A:

DEPT_ID DEPT_NAME
10      Finance
20      Marketing

数据框 B:

EMP_ID DEPT_ID EMP_NAME EMP_SALARY
101    10      AAAA     1000
102    20      BBBB     2000
103    10      CCCC     1500
104    20      DDDD     3000

预期结果:在 Pyspark 中,我需要加入数据框 A 和 BDEPT_ID并将数据框 B 中的非关键列转换为_json 字符串并将其存储Json_Data在数据框 C 的列中

数据帧 C:

DEPT_ID DEPT_NAME Json_Data
10      Finance   [{"_status": "normal","EmpDetails":{"EMP_ID":"101","EMP_NAME":"AAAA","EMP_SALARY":"1000"},{"EMP_ID":"103","EMP_NAME":"CCCC","EMP_SALARY":"1500" }]
20      Marketing [{"_status": "normal","EmpDetails":{"EMP_ID":"102","EMP_NAME":"BBBB","EMP_SALARY":"2000"},{"EMP_ID":"104","EMP_NAME":"DDDD","EMP_SALARY":"3000" }]

标签: apache-sparkpysparkapache-spark-sql

解决方案


您可以加入和分组,DEPT_ID并将DEPT_NAME员工详细信息列表收集到结构中。并用于to_json获取 json 字符串:

from pyspark.sql import functions as F

df_c = df_a.join(df_b, ["DEPT_ID"]).groupBy("DEPT_ID", "DEPT_NAME").agg(
    F.to_json(
        F.struct(
            F.lit("normal").alias("_status"),
            F.collect_list(
                F.struct(
                    F.col("EMP_ID"),
                    F.col("EMP_NAME"),
                    F.col("EMP_SALARY")
                )
            ).alias("EmpDetails")
        )
    ).alias("Json_Data")
)

df_c.show(truncate=False)
#+-------+---------+-----------------------------------------------------------------------------------------------------------------------------------------+
#|DEPT_ID|DEPT_NAME|Json_Data                                                                                                                                |
#+-------+---------+-----------------------------------------------------------------------------------------------------------------------------------------+
#|10     |Finance  |{"_status":"normal","EmpDetails":[{"EMP_ID":101,"EMP_NAME":"AAAA","EMP_SALARY":1000},{"EMP_ID":103,"EMP_NAME":"CCCC","EMP_SALARY":1500}]}|
#|20     |Marketing|{"_status":"normal","EmpDetails":[{"EMP_ID":102,"EMP_NAME":"BBBB","EMP_SALARY":2000},{"EMP_ID":104,"EMP_NAME":"DDDD","EMP_SALARY":3000}]}|
#+-------+---------+-----------------------------------------------------------------------------------------------------------------------------------------+

推荐阅读