java - Reading a file with multiple section headers in Apache Spark with variable section content
问题描述
Is it possible to use Spark APIs to read a large CSV file containing multiple sections having different headers? The structure of the file is as follows
BatchCode#1
Name,Surname,Address
AA1,BBB,CCC
AA2,BBB,CCC
AA3,BBB,CCC
BatchCode#2
Name,Surname,Address,Phone
XY1,BBB,CCC,DDD
XY2,BBB,CCC,DDD
XY3,BBB,CCC,DDD
While reading the records, we need to be careful with the headers as well as the file formats could be different between the sections. The BatchCode information needs to be extracted from the header and should be a part of every record within that section - for example, Data at line 1 should be parsed as:
Name: AAA1
Surname: BBB
Address:CCC
BatchCode:1
The following options come to my mind but I am not completely sure if it could create significant problems:
- Reading the file using wholeTextFile. This will use a single thread to read the file but it would load the entire file in memory and could cause memory issues with large files.
- Forcing Spark to read the file in a single thread using coalesce(1) on sc.textFile. I am not sure if the order is always guaranteed. Once we get the file as RDD, we will cache the header rows while reading the file and merge them with their corresponding data records.
Even if the above approaches work, would they be efficient? What would be the most efficient way?
解决方案
以下程序对我有用:
JavaPairRDD<String, PortableDataStream> binaryFiles = sc.binaryFiles(file);
PortableRecordReader reader = new PortableRecordReader();
JavaPairRDD<String, Record> fileAndLines = binaryFiles.flatMapValues(reader);
PortableRecordReader 打开 DataInputStream 并将其转换为 InputStreamReader 然后使用 CSV 解析器将行转换为 Record 对象中的预期输出并合并标题。
推荐阅读
- label - amCharts 标签字体设置
- sql - 如何拆分表格中日期的各个数字
- chart.js - Chart.js:固定垂直轴的水平位置
- python - 是否可以使用python控制网络模拟OMNeT++
- reactjs - 组件生命周期方法中的竞争条件问题 - 如何及时获取数据?
- bootstrap-4 - 使 Flask-Admin 支持 Bootstrap4
- python - 转换为数据框时 JSON 数据无法正确显示
- javascript - 更改单元格时发送特定电子邮件
- java - 在 PySpark 中为 Scala 类构造函数初始化 Scala 正则表达式
- django - 不止一个站点ERROR Django uwsgi config问题