首页 > 解决方案 > Pyspark - 从多个嵌套的 Json 文件创建数据框

问题描述

我正在尝试从多个嵌套的 json 文件创建一个数据框,其中一些文件具有某些列,而另一些则没有这些列。我编写了有效的代码,但是我也需要将其推广到其他列。你能帮我么。

f1 = multiline_df.select(
    ["productType.mainProductTypeName", "commercialClass.commercialClassNo"]
)
if "strategicPricing" in multiline_df.columns:
    df1 = multiline_df.select(
        [
            "productType.mainProductTypeName",
            "commercialClass.commercialClassNo",
            "strategicPricing.strategicPricingNameEn",
        ]
    )
else:
    df1 = df1.withColumn("strategicPricing", F.lit(None).cast(StringType()))
  1. 如何将以上内容概括为多列
  2. 如何保持条件只获取具有更多最新信息的字典数据?
[
  {
    "updateDate": "2021-01-04T11:24:37Z",
    "deleteDate": null,
    "validFrom": "2008-09-01",
    "validTo": "2012-08-31",
    "paNo": "0131",
    "paName": "Layer glued armchairs",
    "praNo": "013",
    "praName": "Armchairs",
    "hfbNo": "01",
    "hfbName": "Living room seating"
  },
  {
    "updateDate": "2019-07-05T16:01:10Z",
    "deleteDate": null,
    "validFrom": "2012-09-01",
    "validTo": "2015-08-31",
    "paNo": "0114",
    "paName": "Armchairs..",
    "praNo": "011",
    "praName": "Sofas",
    "hfbNo": "01",
    "hfbName": "Living room seating"
  }
]

标签: jsonpysparknested

解决方案


推荐阅读