amazon-web-services - Spark Scala EMR 作业无法从 S3 下载文件
问题描述
我有一个 spark scala 作业,在 AWS EMR 生产中一直失败,我在执行程序中看到的第一个错误是 this Download failed error
。我在 S3 中查看了这些文件,我什至将文件复制到了较低的环境并针对它运行了相同的工作,一切都按预期工作。较低的环境需要处理的数据较少,但除此之外,我不确定为什么会遇到这个问题。生产文件夹确实有一个每小时运行一次并写入新数据的 Glue 作业,但我尝试在胶水作业暂停的情况下运行 emr 作业,但仍然遇到此错误。除此之外,没有编写任何新内容,并且尝试访问的一些文件已有数月之久并且存在数据。
2021-05-20 00:40:15 ERROR S3FSInputStream:295 - Unable to recover reading from stream
2021-05-20 00:40:15 ERROR AsyncFileDownloader:91 - TID: 3497 - Download failed for file path: s3://bucket/folder/part-000.snappy.parquet, range: 0-20427, partition values: [empty row], isDataPresent: false
java.io.IOException: Unexpected end of stream pos=4, contentLength=20427
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.read(S3FSInputStream.java:296)
at org.apache.commons.io.IOUtils.read(IOUtils.java:2454)
at org.apache.commons.io.IOUtils.readFully(IOUtils.java:2537)
at org.apache.hadoop.util.ByteBufferIOUtils.readFullyHeapBuffer(ByteBufferIOUtils.java:89)
at org.apache.hadoop.util.ByteBufferIOUtils.readFully(ByteBufferIOUtils.java:53)
at com.amazon.ws.emr.hadoop.fs.s3.AbstractS3FSInputStream.readFullyIntoBuffers(AbstractS3FSInputStream.java:97)
at org.apache.hadoop.fs.BufferedFSInputStream.readFullyIntoBuffers(BufferedFSInputStream.java:137)
at org.apache.hadoop.fs.FSDataInputStream.readFullyIntoBuffers(FSDataInputStream.java:270)
at org.apache.parquet.hadoop.util.H1SeekableInputStream.readFullyIntoBuffers(H1SeekableInputStream.java:64)
at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1181)
at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:806)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildPrefetcherWithPartitionValues$1.apply(ParquetFileFormat.scala:634)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildPrefetcherWithPartitionValues$1.apply(ParquetFileFormat.scala:576)
at org.apache.spark.sql.execution.datasources.AsyncFileDownloader.org$apache$spark$sql$execution$datasources$AsyncFileDownloader$$downloadFile(AsyncFileDownloader.scala:93)
at org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anonfun$initiateFilesDownload$2$$anon$1.call(AsyncFileDownloader.scala:73)
at org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anonfun$initiateFilesDownload$2$$anon$1.call(AsyncFileDownloader.scala:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.AbortedException:
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleInterruptedException(AmazonHttpClient.java:868)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:746)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5140)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5086)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1490)
at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectCall.perform(GetObjectCall.java:24)
at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectCall.perform(GetObjectCall.java:8)
at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:114)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:191)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.getObject(AmazonS3LiteClient.java:102)
at com.amazon.ws.emr.hadoop.fs.s3.GetObjectInputStreamWithInfoFactory.create(GetObjectInputStreamWithInfoFactory.java:63)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.open(S3FSInputStream.java:199)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.retrieveInputStreamWithInfo(S3FSInputStream.java:390)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.reopenStream(S3FSInputStream.java:377)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.read(S3FSInputStream.java:259)
... 19 more
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.timers.client.SdkInterruptedException
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.checkInterrupted(AmazonHttpClient.java:923)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.afterAttempt(AmazonHttpClient.java:1073)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1196)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
... 36 more
解决方案
推荐阅读
- javascript - 自定义选择反应,过滤列表
- jquery - Jquery似乎没有完成滚动功能
- puppet - Puppet 3.x 的 key.subkey 语法是什么?
- javascript - 未找到模块:无法解析“readline”
- java - 使用自定义 DefaultTreeCellRenderer 后 JTree 节点不更新
- python-3.x - 有没有办法从 Deep security python SDK 获取事件和报告搜索?
- vb.net - 在函数中使用 stringbuilder 的 vb.net 语法
- java - 如何删除链表的一部分?
- azure-devops - 有没有办法在变量组中设置变量
- hadoop - 重启namenode或重启后datanode需要很长时间才能生效