首页 > 解决方案 > 写入 CosmosDB 集合时,数据框列不保持顺序,并且具有空值的列被排除在外

问题描述

我试图将数据从 Spark 中的数据框复制到 cosmosDB 集合中。数据正在写入 cosmosDB ,但有两个问题。

  1. 数据框中的列顺序在 cosmosDB 中未维护。
  2. 具有空值的列不会写入 cosmosDB,它们被完全排除在外。

以下是数据框中可用的数据:

+-------+------+--------+---------+---------+-------+
| NUM_ID|  TIME|    SIG1|     SIG2|     SIG3|   SIG4|
+-------+------+--------+---------+---------+-------+
|X00030 | 13000|35.79893| 139.9061| 48.32786|   null|
|X00095 | 75000|    null|     null|     null|5860505|
|X00074 | 43000|    null|  8.75037|  98.9562|8014505|

下面是用 spark 编写的代码,用于将数据帧复制到 cosmosDB 中。

val finalSignals = spark.sql("""SELECT * FROM db.tableName""")
val toCosmosDF = finalSignals.withColumn("NUM_ID", trim(col("NUM_ID"))).withColumn("SIG1", round(col("SIG1"),5)).select("NUM_ID","TIME","SIG1","SIG2","SIG3","SIG4")

//write DF into COSMOSDB

import com.microsoft.azure.cosmosdb.spark.config.Config
import org.apache.spark.sql.SaveMode
import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark._

val writeConfig = Config(Map(
      "Endpoint" -> "xxxxxxxx",
      "Masterkey" -> "xxxxxxxxxxx",
      "Database" -> "xxxxxxxxx",
      "Collection" -> "xxxxxxxxx",
      "preferredRegions" -> "xxxxxxxxx",
      "Upsert" -> "true"
    ))

toCosmosDF.write.mode(SaveMode.Append).cosmosDB(writeConfig)

下面是写入 cosmosDB 的数据。


    "SIG3": 48.32786,
    "SIG2": 139.9061,
    "TIME": 13000,
    "NUM_ID": "X00030",
    "id": "xxxxxxxxxxxx2a",
    "SIG1": 35.79893,
    "_rid": "xxxxxxxxxxxx",
    "_self": "xxxxxxxxxxxxxxxxxx",
    "_etag": "\"xxxxxxxxxxxxxxxx\"",
    "_attachments": "attachments/",
    "_ts": 1571390120
}

{
    "TIME": 75000,
    "NUM_ID": "X00095",
    "id": "xxxxxxxxxxxx2a",
    "_rid": "xxxxxxxxxxxx",
    "SIG4": 5860505,
    "_self": "xxxxxxxxxxxxxxxxxx",
    "_etag": "\"xxxxxxxxxxxxxxxx\"",
    "_attachments": "attachments/",
    "_ts": 1571390120
}

{
    "SIG3": 98.9562,
    "SIG2": 8.75037,
    "TIME": 43000,
    "NUM_ID": "X00074",
    "id": "xxxxxxxxxxxx2a",
    "SIG4": 8014505,
    "_rid": "xxxxxxxxxxxx",
    "_self": "xxxxxxxxxxxxxxxxxx",
    "_etag": "\"xxxxxxxxxxxxxxxx\"",
    "_attachments": "attachments/",
    "_ts": 1571390120
}
  1. cosmosDB 文档中缺少数据框中具有 null 的列的条目。
  2. 写入 cosmosDB 的数据没有数据框中的列顺序。

如何解决这两个问题?

标签: scalaazureapache-sparkapache-spark-sqlazure-cosmosdb

解决方案


推荐阅读