apache-spark - 排除 CDH 中 spark-core 的依赖性
问题描述
我正在使用结构化 Spark Streaming 写入来自 Kafka 的 HBase 数据。
我的集群分布是:Hadoop 3.0.0-cdh6.2.0,我使用的是 Spark 2.4.0
我的代码如下:
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topic)
.option("failOnDataLoss", false)
.load()
.selectExpr("CAST(key AS STRING)" , "CAST(value AS STRING)")
.as(Encoders.STRING)
df.writeStream
.foreachBatch { (batchDF: Dataset[Row], batchId: Long) =>
batchDF.write
.options(Map(HBaseTableCatalog.tableCatalog->catalog, HBaseTableCatalog.newTable -> "6"))
.format("org.apache.spark.sql.execution.datasources.hbase").save()
}
.option("checkpointLocation", checkpointDirectory)
.start()
.awaitTermination()
HBaseTableCatalog 使用 json4s-jackson_2.11 库。这个库包含在 Spark Core 中,但版本不好,这会产生冲突......
为了解决这个问题,我在 spark 核心中排除了 json4s-jackson_2.11 库,并在 pom 中添加了一个降级版本:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.0-cdh6.2.0</version>
<exclusions>
<exclusion>
<groupId>org.json4s</groupId>
<artifactId>json4s-jackson_2.11</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.json4s</groupId>
<artifactId>json4s-jackson_2.11</artifactId>
<version>3.2.11</version>
</dependency>
当我在我的语言环境机器上执行代码时,它运行良好,但问题是当我在 cloudera 集群中提交它时,我遇到了库冲突的第一个错误:
Caused by: java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse(Lorg/json4s/JsonInput;Z)Lorg/json4s/JsonAST$JValue;
at org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$.apply(HBaseTableCatalog.scala:257)
at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.<init>(HBaseRelation.scala:80)
at org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:59)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
at com.App$$anonfun$main$1.apply(App.scala:129)
at com.App$$anonfun$main$1.apply(App.scala:126)
我知道集群有自己的 hadoop 和 spark 库并且它使用它们,所以,在 spark 提交中,我将 confs spark.driver.userClassPathFirst 和 spark.executor.userClassPathFirst 设为 true,但我有另一个错误和我不明白:
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.<init>(YarnSparkHadoopUtil.scala:48)
at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.<clinit>(YarnSparkHadoopUtil.scala)
at org.apache.spark.deploy.yarn.Client$$anonfun$1.apply$mcJ$sp(Client.scala:83)
at org.apache.spark.deploy.yarn.Client$$anonfun$1.apply(Client.scala:83)
at org.apache.spark.deploy.yarn.Client$$anonfun$1.apply(Client.scala:83)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.deploy.yarn.Client.<init>(Client.scala:82)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1603)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:851)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:926)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:935)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassCastException: org.apache.hadoop.yarn.api.records.impl.pb.PriorityPBImpl cannot be cast to org.apache.hadoop.yarn.api.records.Priority
at org.apache.hadoop.yarn.api.records.Priority.newInstance(Priority.java:39)
at org.apache.hadoop.yarn.api.records.Priority.<clinit>(Priority.java:34)
... 15 more
最后,我想要的是使用我的 pom 中的 json4s-jackson_2.11 而不是 Spark 核心中的那个来制作 Spark
解决方案
为了解决这个问题,不要使用spark.driver.userClassPathFirst
andspark.executor.userClassPathFirst
而是使用spark.driver.extraClassPath
and spark.executor.extraClassPath
。
官方文档中的定义:“附加到驱动程序类路径的额外类路径条目。”
- “prepend”,例如,放在 Spark 的核心类路径之前。
例子 :
--conf spark.driver.extraClassPath=C:\Users\Khalid\Documents\Projects\libs\jackson-annotations-2.6.0.jar;C:\Users\Khalid\Documents\Projects\libs\jackson-core-2.6 .0.jar;C:\Users\Khalid\Documents\Projects\libs\jackson-databind-2.6.0.jar
这解决了我的问题(我想使用的 Jackson 版本与正在使用的一个 spark 版本之间的冲突)。
希望能帮助到你。
推荐阅读
- javascript - 如何设置重启后获取消息的反应收集器?
- r - 有没有办法使用 ggplot 2 中的散点图生成范围图?
- sql - 在 Google BigQuery 中加入两个具有嵌套字段类型“记录”的 PostgreSQL 表
- python - codigo para "hackear" a un usuario
- python-3.x - Python 3.9 已安装但未与 Jupyter Notebook 或 Anaconda 集成(显示)
- r - 调整 R Markdown gt 输出的大小
- c++ - OpenCV C++:在没有 matchTemplate 的情况下实现标准化互相关 (TM_CCORR_NORMED)
- reactjs - Instagram - 兑换代币
- javascript - 尝试通过 React 对 websocket 的一个消息请求设置多个状态时导致浏览器滞后/锁定的原因是什么?
- yarnpkg - 如何使用 Yarn 2+ 列出每个公共工作区?