scala - spark scala avro 写入失败并出现 AbstractMethodError
问题描述
我试图从 avro 读取数据,按字段重新分区数据并将其保存为 avro 格式。下面是我的示例代码。在调试过程中,我无法在我的数据帧上执行 show(10)。它失败并出现以下错误。有人可以帮我理解我在我的代码行中做错了什么吗?
代码:
import org.apache.spark.sql.avro._
val df = spark.read.format("avro").load("s3://test-bucekt/source.avro")
df.show(10)
df.write.partitionBy("partitioning_column").format("avro").save("s3://test-bucket/processed/processed.avro")
show 和 write 都失败并出现以下错误:
java.lang.AbstractMethodError: org.apache.spark.sql.avro.AvroFileFormat.shouldPrefetchData(Lorg/apache/spark/sql/SparkSession;Lorg/apache/spark/sql/types/StructType;Lorg/apache/spark/sql/types/StructType;)Z
at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:309)
at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:305)
at org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:404)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:70)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:283)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:375)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2550)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2764)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
at org.apache.spark.sql.Dataset.show(Dataset.scala:751)
at org.apache.spark.sql.Dataset.show(Dataset.scala:710)
at org.apache.spark.sql.Dataset.show(Dataset.scala:719)
... 85 elided
解决方案
这是由于 emr-5.28.0 中对 FileFormat 的意外二进制不兼容更改导致的,该更改将在 emr-5.29.0 发布时修复。幸运的是,对于 Avro 格式,有一个可以在 emr-5.28.0 中使用的简单解决方法。如果您使用与 EMR 捆绑在一起的 spark-avro jar,则可以使用 Maven Central 中的 spark-avro 版本,而不是使用该版本。也就是说--packages org.apache.spark:spark-avro_2.11:2.4.4
,使用. 而不是类似的东西--jars /usr/lib/spark/external/lib/spark-avro.jar
。
推荐阅读
- python - 用一组值替换 Pandas 子行?
- python - 在 python 中使用 and 运算符时出错
- pandas - pandas:逐行操作以随时间变化
- python - 无论我做了多少测试和更改,我的 pandas 逻辑似乎都没有达到我想要的结果
- git - 尝试执行嵌套的提交子组时来自“git log --graph”的奇怪输出
- editing - 无法在 Marlin 中编辑配置文件
- reactjs - 如何使用 fetch 从 400 错误请求中捕获响应 JSON?
- python - 脸书 API | AdInsights 对比 广告帐户级速率限制
- javascript - 我不确定为什么我不能使用 .forEach() 更改我的 Class 对象的 .value 值
- javascript - 将变量从 Express 服务器传递到 React 前端