apache-spark - 如何在 Spark 中读取没有扩展名的压缩 (gzip) 文件
问题描述
我是 Spark 的新手,手头有一个有趣的任务,我必须从 S3 读取一堆文件,其中包含一些 xml 内容。
这些文件经过压缩 (Gzip),但没有该扩展名。
我在这里阅读了一些关于此的问题,人们建议在 Spark 中扩展默认编解码器并强制使用不同的扩展。
但在我的情况下,没有扩展名,文件以一些 16 位 UUID 格式命名,例如2c7358ca472ad91057da84adfba
.
解决方案
您可以将newAPIHadoopFile(而不是textFile
)与自定义/修改的TextInputFormat一起使用,这会强制使用GzipCodec
.
而不是打电话sparkContext.textFile
,
// gzip compressed but no .gz extension:
sparkContext.textFile("s3://mybucket/uuid")
sparkContext.newAPIHadoopFile
我们可以使用允许我们指定如何读取输入的底层:
import org.apache.hadoop.mapreduce.lib.input.GzipInputFormatWithoutExtention
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
sparkContext
.newAPIHadoopFile(
"s3://mybucket/uuid",
classOf[GzipInputFormatWithoutExtention], // This is our custom reader
classOf[LongWritable],
classOf[Text],
new Configuration(sparkContext.hadoopConfiguration)
)
.map { case (_, text) => text.toString }
通常的调用newAPIHadoopFile
方式是 with TextInputFormat
。这是包含如何读取文件以及根据文件扩展名选择压缩编解码器的部分。
让我们调用它GzipInputFormatWithoutExtention
并将其实现为以下扩展TextInputFormat
(这是一个 Java 文件,我们将它放在包 src/main/java/org/apache/hadoop/mapreduce/lib/input 中):
package org.apache.hadoop.mapreduce.lib.input;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import com.google.common.base.Charsets;
public class GzipInputFormatWithoutExtention extends TextInputFormat {
public RecordReader<LongWritable, Text> createRecordReader(
InputSplit split,
TaskAttemptContext context
) {
String delimiter =
context.getConfiguration().get("textinputformat.record.delimiter");
byte[] recordDelimiterBytes = null;
if (null != delimiter)
recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
// Here we use our custom `GzipWithoutExtentionLineRecordReader`
// instead of `LineRecordReader`:
return new GzipWithoutExtentionLineRecordReader(recordDelimiterBytes);
}
@Override
protected boolean isSplitable(JobContext context, Path file) {
return false; // gzip isn't a splittable codec (as opposed to bzip2)
}
}
事实上,我们必须更深一层,并LineRecordReader
用我们自己的(我们称之为)替换默认的(Java GzipWithoutExtentionLineRecordReader
)。
由于很难继承LineRecordReader
,我们可以复制LineRecordReader
(在 src/main/java/org/apache/hadoop/mapreduce/lib/input 中)并initialize(InputSplit genericSplit, TaskAttemptContext context)
通过强制使用 Gzip 编解码器稍微修改(并简化)该方法:
(与原始版本相比,唯一的变化LineRecordReader
已经给出了解释发生了什么的评论)
package org.apache.hadoop.mapreduce.lib.input;
import java.io.IOException;
import org.apache.hadoop.io.compress.*;
import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.Seekable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
@InterfaceAudience.LimitedPrivate({"MapReduce", "Pig"})
@InterfaceStability.Evolving
public class GzipWithoutExtentionLineRecordReader extends RecordReader<LongWritable, Text> {
private static final Logger LOG =
LoggerFactory.getLogger(GzipWithoutExtentionLineRecordReader.class);
public static final String MAX_LINE_LENGTH =
"mapreduce.input.linerecordreader.line.maxlength";
private long start;
private long pos;
private long end;
private SplitLineReader in;
private FSDataInputStream fileIn;
private Seekable filePosition;
private int maxLineLength;
private LongWritable key;
private Text value;
private boolean isCompressedInput;
private Decompressor decompressor;
private byte[] recordDelimiterBytes;
public GzipWithoutExtentionLineRecordReader(byte[] recordDelimiter) {
this.recordDelimiterBytes = recordDelimiter;
}
public void initialize(
InputSplit genericSplit,
TaskAttemptContext context
) throws IOException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
// open the file and seek to the start of the split
final FileSystem fs = file.getFileSystem(job);
fileIn = fs.open(file);
// This line is modified to force the use of the GzipCodec:
// CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
CompressionCodecFactory ccf = new CompressionCodecFactory(job);
CompressionCodec codec = ccf.getCodecByClassName(GzipCodec.class.getName());
// This part has been extremely simplified as we don't have to handle
// all the different codecs:
isCompressedInput = true;
decompressor = CodecPool.getDecompressor(codec);
if (start != 0) {
throw new IOException(
"Cannot seek in " + codec.getClass().getSimpleName() + " compressed stream"
);
}
in = new SplitLineReader(
codec.createInputStream(fileIn, decompressor), job, this.recordDelimiterBytes
);
filePosition = fileIn;
if (start != 0) {
start += in.readLine(new Text(), 0, maxBytesToConsume(start));
}
this.pos = start;
}
private int maxBytesToConsume(long pos) {
return isCompressedInput
? Integer.MAX_VALUE
: (int) Math.max(Math.min(Integer.MAX_VALUE, end - pos), maxLineLength);
}
private long getFilePosition() throws IOException {
long retVal;
if (isCompressedInput && null != filePosition) {
retVal = filePosition.getPos();
} else {
retVal = pos;
}
return retVal;
}
private int skipUtfByteOrderMark() throws IOException {
int newMaxLineLength = (int) Math.min(3L + (long) maxLineLength,
Integer.MAX_VALUE);
int newSize = in.readLine(value, newMaxLineLength, maxBytesToConsume(pos));
pos += newSize;
int textLength = value.getLength();
byte[] textBytes = value.getBytes();
if ((textLength >= 3) && (textBytes[0] == (byte)0xEF) &&
(textBytes[1] == (byte)0xBB) && (textBytes[2] == (byte)0xBF)) {
LOG.info("Found UTF-8 BOM and skipped it");
textLength -= 3;
newSize -= 3;
if (textLength > 0) {
textBytes = value.copyBytes();
value.set(textBytes, 3, textLength);
} else {
value.clear();
}
}
return newSize;
}
public boolean nextKeyValue() throws IOException {
if (key == null) {
key = new LongWritable();
}
key.set(pos);
if (value == null) {
value = new Text();
}
int newSize = 0;
while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {
if (pos == 0) {
newSize = skipUtfByteOrderMark();
} else {
newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos));
pos += newSize;
}
if ((newSize == 0) || (newSize < maxLineLength)) {
break;
}
LOG.info("Skipped line of size " + newSize + " at pos " +
(pos - newSize));
}
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
@Override
public LongWritable getCurrentKey() {
return key;
}
@Override
public Text getCurrentValue() {
return value;
}
public float getProgress() throws IOException {
if (start == end) {
return 0.0f;
} else {
return Math.min(1.0f, (getFilePosition() - start) / (float)(end - start));
}
}
public synchronized void close() throws IOException {
try {
if (in != null) {
in.close();
}
} finally {
if (decompressor != null) {
CodecPool.returnDecompressor(decompressor);
decompressor = null;
}
}
}
}
推荐阅读
- itext - 为什么尺寸打印的尺寸错误?
- angular - Angular Protractor e2e mat-radio-group innerText 迭代函数错误
- spring-boot - Spring kafka 重试回调不起作用
- python - 可视化倾斜数据
- android - 是否可以以编程方式启用或禁用 Android 相机中的地理标记功能
- ios - 使用带有 mac 催化剂的撤消管理器
- c++ - c ++中的stringstream有助于从字符串中提取逗号分隔的整数,但不能使用向量提取空格分隔的整数,为什么?
- jenkins - 在 Jenkinsfile 中使用 groovy 更新 semver 之后的版本
- powerpoint - 选择部分缩放矩形而不单击(powerpoint)
- css - post-css 找不到来自 node_modules 的路径