首页 > 解决方案 > NoSQL 数据库的键值对

问题描述

我正在尝试将数据框加载到 NoSQL。输入是 CSV 格式的文件,数据为

进出口:

+--------+------+---+---+---+---+----+----+
|DATE    |VAL|100|200|300|400|101 |201 |
+--------+------+---+---+---+---+----+----+
|20200701|A  |1  |2  |3  |4  |1.1 |2.1 |
|20201001|B  |10 |20 |30 |40 |10.1|20.1|
+--------+------+---+---+---+---+----+----+

val_1=[100,200,300,400]

需要将 val1 中的列转储到 json 结构“val_1”和“val_2”中的剩余列。期望的输出

输出/输出:

{
"DATE": "20200701",
"VAL": "A",
"val_1": {
"100":"1",
"200":"2",
"300":"3",
"400":"4"
},
"val_2": {
"101":"1.1",
"201":"2.1"
},
{
"DATE": "20201001",
"VAL": "B",
"val_1": {
"100":"10",
"200":"20",
"300":"30",
"400":"40"
},
"val_2": {
"101":"10.1",
"201":"20.1"
}

标签: apache-sparkpysparkamazon-dynamodb

解决方案


听起来像一个XY 问题。您可能有更好的方法来使用已经存在的连接器执行该任务,但仍然是您当前帖子的解决方案:

df.withColumn("val_1", F.struct(["100", "200", "300", "400"])).withColumn(
    "val_2", F.struct(["101", "201",])
).select("DATE", "VAL", "val_1", "val_2").toJSON().collect()

['{"DATE":"20200701","VAL":"A","val_1":{"100":1,"200":2,"300":3,"400":4},"val_2":{"101":1.1,"201":2.1}}',
 '{"DATE":"20201001","VAL":"B","val_1":{"100":10,"200":20,"300":30,"400":40},"val_2":{"101":10.1,"201":20.1}}']

推荐阅读