java - Apache Spark to_json options parameter
问题描述
I either don't know what I'm looking for or the documentation is lacking. The latter seems to be the case, given this:
"options - options to control how the struct column is converted into a json string. accepts the same options and the json data source."
Great! So, what are my options?
I'm doing something like this:
Dataset<Row> formattedReader = reader
.withColumn("id", lit(id))
.withColumn("timestamp", lit(timestamp))
.withColumn("data", to_json(struct("record_count")));
...and I get this result:
{
"id": "ABC123",
"timestamp": "2018-11-16 20:40:26.108",
"data": "{\"record_count\": 989}"
}
I'd like this (remove back-slashes and quotes from "data"):
{
"id": "ABC123",
"timestamp": "2018-11-16 20:40:26.108",
"data": {"record_count": 989}
}
Is this one of the options by chance? Is there a better guide out there for Spark? The most frustrating part about Spark hasn't been getting it to do what I want, it's been a lack of good information on what it can do.
解决方案
You are json encoding twice for the record_count field. Remove to_json. struct alone should be sufficient.
As in change your code to something like this.
Dataset<Row> formattedReader = reader
.withColumn("id", lit(id))
.withColumn("timestamp", lit(timestamp))
.withColumn("data", struct("record_count"));
推荐阅读
- java - 如何在jpa java中删除具有外键的数据
- php - Silverstripe 3:在注册内容中使用成员变量
- powerbi - 使用 DAX 查找从一个表生成的两个表之间的差异
- testing - TestCafe:链接选择器/函数似乎不起作用
- javascript - 监听浏览器范围内的事件 - 浏览器应用栏按钮上的 keyUp
- c# - 由于 HttpClient 请求缓慢,Task.Result 在 Parallel.ForEach 内阻塞
- powerapps - 嵌套循环和 API 调用
- variables - 是否可以在 MVC routeconfig 文件中为操作使用变量
- verilog - 在测试台注册初始值
- queue - (Laravel 5) 监控并有选择地取消队列中的 ALREADY RUNNING 作业