首页 > 解决方案 > 当转义字符为引号(“)时,Pyspark 读取 csv 不会读取整列

问题描述

如果我有一个引号(“)作为转义字符并使用 pyspark 读取 API 读取数据,则它不会正确映射。

以下是如何重现它。

tx = 'id,name,address,city,country\n"1","",", 1ST, ""Round Street""","",UK'
file=open('temp.csv','wt')
file.writelines(tx)
file.close()
df = spark.read.csv('temp.csv', header=True, escape='"')

df.show(1,False)
+---+----+---------------------+----+-------+
|id |name|address              |city|country|
+---+----+---------------------+----+-------+
|1  |null|, 1ST, "Round Street"|null|UK     |
+---+----+---------------------+----+-------+

df.select('address').show(1, False)
+-------+
|address|
+-------+
| 1ST   |
+-------+

我是否因为没有得到正确的列值而遗漏了什么?

标签: csvapache-sparkpyspark

解决方案


它适用于此设置

from pyspark.sql import SparkSession

tx = '''id,name,address,city,country
"1","",", 1ST, ""Round Street""","",UK
"1"," ",", 1ST, ""Round Street""","",UK
"id-1","name-1",", 1ST, ""Round Street""","city-1",UK'''

with open('temp.csv','wt') as file:
    file.writelines(tx)

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

df = spark.read.format("csv")\
  .option("sep", ",")\
  .option("quote", '"')\
  .option("escape", '"')\
  .option("inferSchema", "true")\
  .option("header", "true")\
  .load('temp.csv').rdd.toDF()

df.select('address').show()
df.show()

>>> from pyspark.sql import SparkSession
>>> 
>>> tx = '''id,name,address,city,country
... "1","",", 1ST, ""Round Street""","",UK
... "1"," ",", 1ST, ""Round Street""","",UK
... "id-1","name-1",", 1ST, ""Round Street""","city-1",UK'''
>>> 
>>> with open('temp.csv','wt') as file:
...     file.writelines(tx)
... 
>>> spark = SparkSession \
...     .builder \
...     .appName("Python Spark SQL basic example") \
...     .config("spark.some.config.option", "some-value") \
...     .getOrCreate()
>>> 
>>> df = spark.read.format("csv")\
...   .option("sep", ",")\
...   .option("quote", '"')\
...   .option("escape", '"')\
...   .option("inferSchema", "true")\
...   .option("header", "true")\
...   .load('temp.csv').rdd.toDF()
>>>
>>>
>>> df.show()
+----+------+--------------------+------+-------+
|  id|  name|             address|  city|country|
+----+------+--------------------+------+-------+
|   1|  null|, 1ST, "Round Str...|  null|     UK|
|   1|      |, 1ST, "Round Str...|  null|     UK|
|id-1|name-1|, 1ST, "Round Str...|city-1|     UK|
+----+------+--------------------+------+-------+

>>> df.select('address').show()
+--------------------+
|             address|
+--------------------+
|, 1ST, "Round Str...|
|, 1ST, "Round Str...|
|, 1ST, "Round Str...|
+--------------------+

测试版本

Python 3.6.12
pyspark==3.0.1
spark==0.2.1

推荐阅读