scala - 使用univocity csv parser scala解析不同文件和选项的Spark CSV
问题描述
我正在尝试使用以下设置解析此 csv 文件。
ArrayType
"[""a"",""ab"",""avc""]"
"[1,23,33]"
"[""1"",""22""]"
"[""1"",""22"",""12222222.32342342314"",123412423523.3414]"
"[a,c,s,a,d,a,q,s]"
"["""","""","""",""""]"
"["","","",""]"
"[""abcgdjasc"",""jachdac"",""''""]"
"[""a"",""ab"",""avc""]"
val df = spark.read.format("csv").option("header","true").option("escape","\"").option("quote","\"").load("/home/ArrayType.csv")
输出:
scala> df.show()
+--------------------+
| ArrayType|
+--------------------+
| ["a","ab","avc"]|
| [1,23,33]|
| ["1","22"]|
|["1","22","122222...|
| [a,c,s,a,d,a,q,s]|
| ["","","",""]|
| [",",","]|
|["abcgdjasc","jac...|
| ["a","ab","avc"]|
+--------------------+
但是,由于这里的转义字符是 "\"" ,我可以将其作为单列读取,而如果输入文件如下所示,
ArrayType
"["a","ab","avc"]"
"[1,23,33]"
"["1","22"]"
"["1","22","12222222.32342342314",123412423523.3414]"
"[a,c,s,a,d,a,q,s]"
"["","","",""]"
"[",",","]"
"["abcgdjasc","jachdac","''"]"
"["a","ab","avc"]"
它向我显示了以下输出,而我需要它以与以前相同的方式进行解析。
scala> df.show()
+-----------------+-------+--------------------+-------------------+
| _c0| _c1| _c2| _c3|
+-----------------+-------+--------------------+-------------------+
| "["a"| ab| "avc"]"| |
| [1,23,33]| | | |
| "["1"| "22"]"| | |
| "["1"| 22|12222222.32342342314|123412423523.3414]"|
|[a,c,s,a,d,a,q,s]| | | |
| [",",","]| | | |
| [| ,| ]| |
| "["abcgdjasc"|jachdac| "''"]"| |
| "["a"| ab| "avc"]"| |
| "["a"| ab| "avc"]"| |
+------+-------------+-----------------+-------+--------------------
所以,即使字符串没有被转义,我仍然想得到和之前一样的输出,不用逗号分隔。
如何将第二个 csv 文件作为数据框中的单列获取?
如何支持将两种文件解析为单列?
我正在使用 univocity CSV 解析器进行解析。
解决方案
推荐阅读
- ios - 启用手势 ARKit
- java - 如何将 For In 循环变量插入到变量调用中
- sql - 三张表之间的SQL关系
- javascript - 难以将唯一事件侦听器添加到已创建通过 for 循环的元素
- dictionary-comprehension - 我可以使用条件语句代替python字典的键吗?如果该键的条件语句为True,则返回值
- c# - 无法使用 SetAccessRuleProtection() 启用/禁用 ACE 继承
- java - java中的循环开关盒
- node.js - 如何在 Mongoose 中将聚合与 $lookup 一起使用,并在子子数组中使用外键?
- dns - 什么是最短的保留(示例)域
- javascript - 我正在使用 tinymce 和 Firebase 输入笔记应用程序。问题是我无法检索 tinymce textarea 上的数据进行编辑