python - 使用 PySpark 数据框解析 json 字符串列表
问题描述
我正在尝试使用 pyspark 数据框读取 JSON 列表。您将在我的输入数据下方找到,我的目标是获得一个包含两列用户(字符串)和 ips 数组 [Sting] 的数据框。
sampleJson = [ ('{"user":100, "ips" : ["191.168.192.101", "191.168.192.103", "191.168.192.96", "191.168.192.99"]}',), ('{"user":101, "ips" : ["191.168.192.102", "191.168.192.105", "191.168.192.103", "191.168.192.107"]}',), ('{"user":102, "ips" : ["191.168.192.105", "191.168.192.101", "191.168.192.105", "191.168.192.107"]}',), ('{"user":103, "ips" : ["191.168.192.96", "191.168.192.100", "191.168.192.107", "191.168.192.101"]}',), ('{"user":104, "ips" : ["191.168.192.99", "191.168.192.99", "191.168.192.102", "191.168.192.99"]}',), ('{"user":105, "ips" : ["191.168.192.99", "191.168.192.99", "191.168.192.100", "191.168.192.96"]}',), ]
谢谢您的帮助。
解决方案
使用from_json
函数defining schema
。
Example:
from pyspark.sql.functions import *
from pyspark.sql.types import *
sampleJson = [ ('{"user":100, "ips" : ["191.168.192.101", "191.168.192.103", "191.168.192.96", "191.168.192.99"]}',), ('{"user":101, "ips" : ["191.168.192.102", "191.168.192.105", "191.168.192.103", "191.168.192.107"]}',), ('{"user":102, "ips" : ["191.168.192.105", "191.168.192.101", "191.168.192.105", "191.168.192.107"]}',), ('{"user":103, "ips" : ["191.168.192.96", "191.168.192.100", "191.168.192.107", "191.168.192.101"]}',), ('{"user":104, "ips" : ["191.168.192.99", "191.168.192.99", "191.168.192.102", "191.168.192.99"]}',), ('{"user":105, "ips" : ["191.168.192.99", "191.168.192.99", "191.168.192.100", "191.168.192.96"]}',), ]
df1=spark.createDataFrame(sampleJson)
sch=StructType([StructField('user', StringType(), False),StructField('ips',ArrayType(StringType()))])
df1.withColumn("n",from_json(col("_1"),sch)).select("n.*").show(10,False)
#+----+--------------------------------------------------------------------+
#|user|ips |
#+----+--------------------------------------------------------------------+
#|100 |[191.168.192.101, 191.168.192.103, 191.168.192.96, 191.168.192.99] |
#|101 |[191.168.192.102, 191.168.192.105, 191.168.192.103, 191.168.192.107]|
#|102 |[191.168.192.105, 191.168.192.101, 191.168.192.105, 191.168.192.107]|
#|103 |[191.168.192.96, 191.168.192.100, 191.168.192.107, 191.168.192.101] |
#|104 |[191.168.192.99, 191.168.192.99, 191.168.192.102, 191.168.192.99] |
#|105 |[191.168.192.99, 191.168.192.99, 191.168.192.100, 191.168.192.96] |
#+----+--------------------------------------------------------------------+
#schema
df1.withColumn("n",from_json(col("_1"),sch)).select("n.*").printSchema()
#root
# |-- user: string (nullable = true)
# |-- ips: array (nullable = true)
# | |-- element: string (containsNull = true)
推荐阅读
- r - R - 将列值拆分为新的多列
- sql - 在 Select LEFT 中区分 1000 和 10000 进行分组
- node.js - MongoDB 文档中已弃用的字段何时被删除?
- ruby-on-rails - Rails 6 预期文件定义常量
- javascript - 在反应材料表中添加新行时,有没有办法为单元格设置默认值?
- javascript - aurelia 是否适用于 iOS 和 Android 应用程序中的嵌套?
- vb.net - VB.net,在屏幕上显示文本的最佳方式是什么,在主窗体之外,即使窗体最小化也仍然可见?
- swift - Swift:安装任何 Xcode 项目都可用的包?
- postgresql - 在 Unix 脚本中使用 pgpl /copy 命令
- python - 如何同时运行多个发布者 paho-mqtt?