csv - 当转义字符为引号(“)时,Pyspark 读取 csv 不会读取整列
问题描述
如果我有一个引号(“)作为转义字符并使用 pyspark 读取 API 读取数据,则它不会正确映射。
以下是如何重现它。
tx = 'id,name,address,city,country\n"1","",", 1ST, ""Round Street""","",UK'
file=open('temp.csv','wt')
file.writelines(tx)
file.close()
df = spark.read.csv('temp.csv', header=True, escape='"')
df.show(1,False)
+---+----+---------------------+----+-------+
|id |name|address |city|country|
+---+----+---------------------+----+-------+
|1 |null|, 1ST, "Round Street"|null|UK |
+---+----+---------------------+----+-------+
df.select('address').show(1, False)
+-------+
|address|
+-------+
| 1ST |
+-------+
我是否因为没有得到正确的列值而遗漏了什么?
解决方案
它适用于此设置
from pyspark.sql import SparkSession
tx = '''id,name,address,city,country
"1","",", 1ST, ""Round Street""","",UK
"1"," ",", 1ST, ""Round Street""","",UK
"id-1","name-1",", 1ST, ""Round Street""","city-1",UK'''
with open('temp.csv','wt') as file:
file.writelines(tx)
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
df = spark.read.format("csv")\
.option("sep", ",")\
.option("quote", '"')\
.option("escape", '"')\
.option("inferSchema", "true")\
.option("header", "true")\
.load('temp.csv').rdd.toDF()
df.select('address').show()
df.show()
跑
>>> from pyspark.sql import SparkSession
>>>
>>> tx = '''id,name,address,city,country
... "1","",", 1ST, ""Round Street""","",UK
... "1"," ",", 1ST, ""Round Street""","",UK
... "id-1","name-1",", 1ST, ""Round Street""","city-1",UK'''
>>>
>>> with open('temp.csv','wt') as file:
... file.writelines(tx)
...
>>> spark = SparkSession \
... .builder \
... .appName("Python Spark SQL basic example") \
... .config("spark.some.config.option", "some-value") \
... .getOrCreate()
>>>
>>> df = spark.read.format("csv")\
... .option("sep", ",")\
... .option("quote", '"')\
... .option("escape", '"')\
... .option("inferSchema", "true")\
... .option("header", "true")\
... .load('temp.csv').rdd.toDF()
>>>
>>>
>>> df.show()
+----+------+--------------------+------+-------+
| id| name| address| city|country|
+----+------+--------------------+------+-------+
| 1| null|, 1ST, "Round Str...| null| UK|
| 1| |, 1ST, "Round Str...| null| UK|
|id-1|name-1|, 1ST, "Round Str...|city-1| UK|
+----+------+--------------------+------+-------+
>>> df.select('address').show()
+--------------------+
| address|
+--------------------+
|, 1ST, "Round Str...|
|, 1ST, "Round Str...|
|, 1ST, "Round Str...|
+--------------------+
测试版本
Python 3.6.12
pyspark==3.0.1
spark==0.2.1
推荐阅读
- c# - 将一个 DTO 转换为另一个的设计模式
- go - 致命错误:所有 goroutine 都处于休眠状态 - 死锁
- dart - 单击时颤动扩展图块消失
- javascript - 如何在 html cordova android 应用程序中创建“添加到收藏夹”功能?
- html - How can i place submit button inside input field in Bootstrap
- c++ - 复制初始化:为什么即使关闭复制省略也不调用移动或复制构造函数?
- xamarin - 如何在 TextChanged 事件上重绘 ZXing 二维码 - Xamarin
- spring-integration - 由于 bean 初始化不当,Spring Cloud AWS kinesis 流绑定器无法启动
- python - How to delete ADUser using Python pyad module
- c - 打印出指针的值时是否需要间接运算符?