首页 > 解决方案 > Spark Scala EMR 作业无法从 S3 下载文件

问题描述

我有一个 spark scala 作业,在 AWS EMR 生产中一直失败,我在执行程序中看到的第一个错误是 this Download failed error。我在 S3 中查看了这些文件,我什至将文件复制到了较低的环境并针对它运行了相同的工作,一切都按预期工作。较低的环境需要处理的数据较少,但除此之外,我不确定为什么会遇到这个问题。生产文件夹确实有一个每小时运行一次并写入新数据的 Glue 作业,但我尝试在胶水作业暂停的情况下运行 emr 作业,但仍然遇到此错误。除此之外,没有编写任何新内容,并且尝试访问的一些文件已有数月之久并且存在数据。

2021-05-20 00:40:15 ERROR S3FSInputStream:295 - Unable to recover reading from stream
2021-05-20 00:40:15 ERROR AsyncFileDownloader:91 - TID: 3497 - Download failed for file path: s3://bucket/folder/part-000.snappy.parquet, range: 0-20427, partition values: [empty row], isDataPresent: false
java.io.IOException: Unexpected end of stream pos=4, contentLength=20427
    at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.read(S3FSInputStream.java:296)
    at org.apache.commons.io.IOUtils.read(IOUtils.java:2454)
    at org.apache.commons.io.IOUtils.readFully(IOUtils.java:2537)
    at org.apache.hadoop.util.ByteBufferIOUtils.readFullyHeapBuffer(ByteBufferIOUtils.java:89)
    at org.apache.hadoop.util.ByteBufferIOUtils.readFully(ByteBufferIOUtils.java:53)
    at com.amazon.ws.emr.hadoop.fs.s3.AbstractS3FSInputStream.readFullyIntoBuffers(AbstractS3FSInputStream.java:97)
    at org.apache.hadoop.fs.BufferedFSInputStream.readFullyIntoBuffers(BufferedFSInputStream.java:137)
    at org.apache.hadoop.fs.FSDataInputStream.readFullyIntoBuffers(FSDataInputStream.java:270)
    at org.apache.parquet.hadoop.util.H1SeekableInputStream.readFullyIntoBuffers(H1SeekableInputStream.java:64)
    at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1181)
    at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:806)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildPrefetcherWithPartitionValues$1.apply(ParquetFileFormat.scala:634)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildPrefetcherWithPartitionValues$1.apply(ParquetFileFormat.scala:576)
    at org.apache.spark.sql.execution.datasources.AsyncFileDownloader.org$apache$spark$sql$execution$datasources$AsyncFileDownloader$$downloadFile(AsyncFileDownloader.scala:93)
    at org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anonfun$initiateFilesDownload$2$$anon$1.call(AsyncFileDownloader.scala:73)
    at org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anonfun$initiateFilesDownload$2$$anon$1.call(AsyncFileDownloader.scala:72)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.AbortedException: 
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleInterruptedException(AmazonHttpClient.java:868)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:746)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5140)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5086)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1490)
    at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectCall.perform(GetObjectCall.java:24)
    at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectCall.perform(GetObjectCall.java:8)
    at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:114)
    at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:191)
    at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.getObject(AmazonS3LiteClient.java:102)
    at com.amazon.ws.emr.hadoop.fs.s3.GetObjectInputStreamWithInfoFactory.create(GetObjectInputStreamWithInfoFactory.java:63)
    at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.open(S3FSInputStream.java:199)
    at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.retrieveInputStreamWithInfo(S3FSInputStream.java:390)
    at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.reopenStream(S3FSInputStream.java:377)
    at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.read(S3FSInputStream.java:259)
    ... 19 more
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.timers.client.SdkInterruptedException
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.checkInterrupted(AmazonHttpClient.java:923)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.afterAttempt(AmazonHttpClient.java:1073)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1196)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
    ... 36 more

标签: amazon-web-servicesapache-sparkamazon-s3parquetsnappy

解决方案


推荐阅读