首页 > 解决方案 > Reading a file with multiple section headers in Apache Spark with variable section content

问题描述

Is it possible to use Spark APIs to read a large CSV file containing multiple sections having different headers? The structure of the file is as follows

BatchCode#1
Name,Surname,Address
AA1,BBB,CCC
AA2,BBB,CCC
AA3,BBB,CCC

BatchCode#2
Name,Surname,Address,Phone
XY1,BBB,CCC,DDD
XY2,BBB,CCC,DDD
XY3,BBB,CCC,DDD

While reading the records, we need to be careful with the headers as well as the file formats could be different between the sections. The BatchCode information needs to be extracted from the header and should be a part of every record within that section - for example, Data at line 1 should be parsed as:

Name: AAA1
Surname: BBB
Address:CCC
BatchCode:1

The following options come to my mind but I am not completely sure if it could create significant problems:

  1. Reading the file using wholeTextFile. This will use a single thread to read the file but it would load the entire file in memory and could cause memory issues with large files.
  2. Forcing Spark to read the file in a single thread using coalesce(1) on sc.textFile. I am not sure if the order is always guaranteed. Once we get the file as RDD, we will cache the header rows while reading the file and merge them with their corresponding data records.

Even if the above approaches work, would they be efficient? What would be the most efficient way?

标签: javascalacsvapache-spark

解决方案


以下程序对我有用:

JavaPairRDD<String, PortableDataStream> binaryFiles = sc.binaryFiles(file);

PortableRecordReader reader = new PortableRecordReader();
JavaPairRDD<String, Record> fileAndLines = binaryFiles.flatMapValues(reader);

PortableRecordReader 打开 DataInputStream 并将其转换为 InputStreamReader 然后使用 CSV 解析器将行转换为 Record 对象中的预期输出并合并标题。


推荐阅读