首页 > 解决方案 > PySpark,如何解析格式化为 dict 的字符串并将一些键附加为新列

问题描述

我在读取带有好奇列的 csv 文件时遇到问题。

架构

root
 |-- Id: integer (nullable = true)
 |-- Lon_tower: double (nullable = true)
 |-- Lat_tower: double (nullable = true)
 |-- Compagny: string (nullable = true)
 |-- Address_tower: string (nullable = true)
 |-- Assigned_band_1: string (nullable = true)
 |-- Assigned_band_2: string (nullable = true)
 |-- Assigned_band_3: string (nullable = true)
 |-- Assigned_band_4: string (nullable = true)
 |-- Assigned_band_5: string (nullable = true)
 |-- raw_geocode: string (nullable = true)

raw_geocode 示例

[{'road': 'Calle el Topo', 'residential': 'Los Sauces', 'hamlet': 'El Cardal', 'village': 'Los Sauces', 'city': 'San Andrés y Sauces', 'county': 'Santa Cruz de Tenerife', 'archipelago': 'Canarias', 'postcode': '38720', 'country': 'España', 'country_code': 'es'}]

如果该行不存在该键,我想将键作为标题并用值或 Null 填充 sparkdataframe。我不想要所有的密钥,而只想要列表中的一些。我删除了 [']

一个更好理解的例子:

myList = ['road', 'tourism', 'country_code']
|Id   |...|raw_geocode |
|1    |...|{road: Calle el Topo, archipelago: Canarias, postcode: 38720, country_code: es}
|2    |...|{tourism: Mirador Montaña El Molino, road: Mirador Montaña El Molino, village: Barlovento, country_code: es} 

期望的结果

|ID  |...|road          |tourism                   |country_code|
|1   |...|Calle el Topo |NULL                      |es
|2   |...|Null          |Mirador Montaña El Molino |es

标签: dataframeapache-sparkpysparkapache-spark-sql

解决方案


您可以使用regexp_extract提取所需的值:

myList = ['road', 'tourism', 'country_code']

for i in myList:
    df = df.withColumn(
        i, 
        F.when(
            F.regexp_extract('raw_geocode', i+': ([^,}]+)', 1) != "",
            F.regexp_extract('raw_geocode', i+': ([^,}]+)', 1)
        )
    )

df.show(truncate=False)
+---+------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+------------+
|Id |raw_geocode                                                                                                 |road                     |tourism                  |country_code|
+---+------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+------------+
|1  |{road: Calle el Topo, archipelago: Canarias, postcode: 38720, country_code: es}                             |Calle el Topo            |null                     |es          |
|2  |{tourism: Mirador Montaña El Molino, road: Mirador Montaña El Molino, village: Barlovento, country_code: es}|Mirador Montaña El Molino|Mirador Montaña El Molino|es          |
+---+------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+------------+

推荐阅读