首页 > 解决方案 > 读取具有不同列顺序的文件

问题描述

我有几个带有标题的 csv 文件,但我发现有些文件有不同的列顺序。有没有办法用 Spark 来处理这个问题,我可以为每个文件定义选择顺序,这样主 DF 就不会出现 col x 可能具有 col y 值的不匹配?

我目前的阅读 -

 val masterDF = spark.read.option("header", "true").csv(allFiles:_*)

标签: scalaapache-sparkpyspark

解决方案


Extract all file names and store into list variable.

  • Then define schema of with all the columns in it.

  • iterate through each file using header true, so we are reading each file separately.

  • unionAll the new dataframe with the existing dataframe.

Example:

file_lst=['<path1>','<path2>']

from pyspark.sql.functions import *
from pyspark.sql.types import *

#define schema for the required columns
schema = StructType([StructField("column1",StringType(),True),StructField("column2",StringType(),True)])

#create an empty dataframe
df=spark.createDataFrame([],schema)

for i in file_lst:
    tmp_df=spark.read.option("header","true").csv(i).select("column1","column2")
    df=df.unionAll(tmp_df)

#display results
df.show()

推荐阅读