首页 > 解决方案 > 使用另一个列值提取嵌套列

问题描述

我有一个 df 行:

 account_number          | null                 
 additional_info         | null                 
 data                    | [,,,,,,,,,,,,,,,,... 
 dealer_id               | null                 
 device_id               | 0                    
 device_name             | null                 
 device_type             | panel                
 event                   | setting              
 event_class             | panel_settings_alarm 
 event_desc              | siren_disable is ... 
 event_timestamp         | 1621341705282        
 event_type              | siren_disable        
 imei                    | 99000000000000      
 is_event_desc_available | true                 
 message_uuid            | b32e6a3f-bced-4ba... 
 operation               | report               
 partition_id            | 0                    
 priority                | 2                    
 software_version        | 2.4.2                
 source                  | panel                
 setting_name            | siren_disable   

嵌套列数据有多个键值对。使用 Key 作为 setting_name 列,我需要创建另一个值为 data[setting_name] 的列,例如: data['siren_disable'] 设置名称中的名称不断变化。我试过了:

df2 = df2.withColumn("setting_value",df2["data"].getItem(df2['setting_name']))

但出现错误:

AnalysisException: Field name should be String Literal, but it's setting_name#3864;

有什么建议么?

标签: pythonpysparknested

解决方案


一种可能的方法是将结构类型列转换为映射类型:

import pyspark.sql.functions as F

df2 = df.withColumn(
    'setting_value', 
    F.from_json(F.to_json('data'), 'map<string,string>')[F.col('setting_name')]
)

推荐阅读