首页 > 解决方案 > PySpark escapeQuotes=False 仍然转义引号

问题描述

问题:将数据框写为 csv 时,我不想转义引号。但是,设置escapeQuotes=False似乎不起作用。

下面提到的是一个示例案例:

数据准备:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql import SparkSession, functions as func

spark = SparkSession.builder.appName("test").getOrCreate()

data = [("James", "Smith"),
    ("Michael", "Rose"),
  ]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("lastname",StringType(),True)
  ])
 
df = spark.createDataFrame(data=data,schema=schema)
df.show(truncate=False)

输出:

+---------+--------+
|firstname|lastname|
+---------+--------+
|James    |Smith   |
|Michael  |Rose    |
+---------+--------+

添加带有换行符的列

def create_column_with_newline(elem):
    return f'"{elem["firstname"]}\n{elem["lastname"]}"'


columnWithNewlineUDF = func.udf(create_column_with_newline)

df = df.withColumn('newline_col', columnWithNewlineUDF(func.struct('firstname', 'lastname')))
df.show()

输出:

+---------+--------+-----------------+
|firstname|lastname|      newline_col|
+---------+--------+-----------------+
|    James|   Smith|    "James
Smith"|
|  Michael|    Rose|   "Michael
Rose"|
+---------+--------+-----------------+

用 escapeQuotes=False 编写 csv

df.coalesce(1).write.csv('test.tsv', mode='overwrite', sep='\t', header=True, encoding='UTF-8', escapeQuotes=False)

输出:

firstname   lastname    newline_col
James   Smith   "\"James
Smith\""
Michael Rose    "\"Michael
Rose\""

如您所见,newline_col是用转义引号编写的 :-(

预期输出:

firstname   lastname    newline_col
James   Smith   "James
Smith"
Michael Rose    "Michael
Rose"

标签: apache-sparkpysparkapache-spark-sql

解决方案


只需从 UDF 中删除引号:

def create_column_with_newline(elem):
    #      f'"{elem["firstname"]}\n{elem["lastname"]}"'
    return f'{elem["firstname"]}\n{elem["lastname"]}'

输出:

firstname   lastname    newline_col
James   Smith   "James
Smith"
Michael Rose    "Michael
Rose"

Excel可视化:

擅长


推荐阅读