首页 > 解决方案 > 如何从pyspark中具有json类型字符串的列中提取数据?

问题描述

嗨,我有一个数据框

client_id| event_metadata                                      |
+---------+-----------------------------------------------------
| 18890  |{Scripname:"DELL", Exchange: "NSE", Segment: "EQ" }  |
| 10531  |{Scripname:"NAUKRI", Exchange: "NSE", Segment: "EQ" }|

我想提取 event_metadata 并仅将 ScripName 和 client_id 存储为数据框。

event_metadata 是字符串而不是 json。

我努力了

from pyspark.sql import functions as F

df1.select('client_id', F.json_tuple('event_metadata', 'Scripname', 
 'Exchange','Segment').alias('Scripname',
  'Exchange','Segment')).show()

它返回 Null 值

我也尝试过使用正则表达式但显示错误

from pyspark.sql.functions import regexp_extract

df1.withColumn("event_metadata", regexp_extract("event_metadata", "(? 
<=Scripname: )\w+(?=(,|}))", 0))\
 .show(truncate=False)

期望的输出:

client_id| Scripname|
+--------+-----------
| 18890  |  DELL    |
| 10531  |  NAUKRI  |

标签: pyspark

解决方案


尝试这个-

regexp_extract

df2.withColumn("Scripname",
      regexp_extract($"event_metadata", "^\\{\\s*Scripname\\s*:\\s*\"(\\w+)\"", 1)
    )
      .show(false)

    df2.withColumn("Scripname",
      expr("""regexp_extract(event_metadata, '^\\{\\s*Scripname\\s*:\\s*"(\\w+)"', 1)""")
    )
      .show(false)


    /**
      * +---------+-----------------------------------------------------+---------+
      * |client_id|event_metadata                                       |Scripname|
      * +---------+-----------------------------------------------------+---------+
      * |18890    |{Scripname:"DELL", Exchange: "NSE", Segment: "EQ" }  |DELL     |
      * |10531    |{Scripname:"NAUKRI", Exchange: "NSE", Segment: "EQ" }|NAUKRI   |
      * +---------+-----------------------------------------------------+---------+
      */

推荐阅读