scala - 使用本机 orc impl“java.lang.NegativeArraySizeException”时引发 orc 读取引发
问题描述
spark 本机兽人阅读器未按预期工作。请在下面找到详细信息
import org.apache.spark.sql.{Dataset, Encoders, SparkSession}
case class GateDoc(var xml: Array[Byte], var cknid: String = null)
spark.conf.set("spark.sql.orc.impl","native")
import spark.implicits._
val df = spark.read.schema(Encoders.product[GateDoc].schema).orc(inputFile).as[GateDoc] // problem here while reading throws below mentioned exception
df.write.orc(op)
投掷
java.lang.NegativeArraySizeException
at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:1506)
at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:1528)
at org.apache.orc.impl.TreeReaderFactory$BinaryTreeReader.nextVector(TreeReaderFactory.java:878)
at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2012)
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1284)
at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:227)
at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:109)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:215)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:232)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
但是,使用 spark.sql.orc.impl = hive 时效果很好
import org.apache.spark.sql.{Dataset, Encoders, SparkSession}
case class GateDoc(var xml: Array[Byte], var cknid: String = null)
spark.conf.set("spark.sql.orc.impl","hive")
import spark.implicits._
val df = spark.read.schema(Encoders.product[GateDoc].schema).orc(inputFile).as[GateDoc]
df.write.orc(op)
我很清楚为什么在我的用例中抛出 java.lang.NegativeArraySizeException 但为什么它用负值初始化我的案例类数组类型字段?
我还检查了一些分区元数据,如下所示
java -jar/usr/lib/spark/jars/orc-tools-1.5.5-uber.jar data part-00.snappy.orc
java -jar /usr/lib/spark/jars/orc-tools-1.5.5-uber.jar meta part-00.snappy.orc
这似乎没问题
有关环境的更多详细信息:
Scala version 2.11.12
Spark Version 2.4.4
Orc Version 1.5.5
EMR emr-5.29.0
请帮帮我。我觉得兽人原生阅读器存在一些错误。
解决方案
推荐阅读
- android - Android Firebase 从 post 元素获取 Post id
- ios - UIView.transition 显示具有深色背景的父视图(Swift)
- javascript - 基于 if 语句链接额外的 Sweet Alerts
- vue.js - vs代码自动完成其他
- javascript - javascript使用一次后没有悬停颜色
- json - 将 HTTP Post 的响应转换为 Typescript 对象列表
- python - python turtle - 如何使get_poly函数保持形状未填充
- php - 如何使用 Datatable 中的下拉列表更新数据库表
- javascript - Php 和 Ajax 电子邮件表单
- swift - 如何在 Xcode 中导入和使用 CreateML 或 CreateMLUI