apache-spark - 如何解决 AWS EMR 集群中的 NoClassDefFoundError: org/apache/spark/sql/types/DataType?
问题描述
在 AWS EMR (v 5.23.0) 中提交 spark 作业时,我收到以下错误:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/types/DataType
at etl.SparkDataProcessor$.processTransactionData(SparkDataProcessor.scala:51)
at etl.SparkDataProcessor$.delayedEndpoint$etl$SparkDataProcessor$1(SparkDataProcessor.scala:17)
at etl.SparkDataProcessor$delayedInit$body.apply(SparkDataProcessor.scala:11)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:383)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at etl.SparkDataProcessor$.main(SparkDataProcessor.scala:11)
at etl.SparkDataProcessor.main(SparkDataProcessor.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.types.DataType
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
... 18 more
尝试按照 stackoverflow 中的其他提交来解决相同的问题,但仍然没有运气。在 intelliJ 本地运行应用程序工作正常,我使用sbt assembly
. 下面是我的 build.sbt 文件
笔记。我什至添加了assemblyExcludedJars以查看是否有帮助。以前,这是不存在的。
name := "blah"
version := "0.1"
scalaVersion := "2.11.0"
sparkVersion := "2.4.0"
artifactName := { (sv: ScalaVersion, module: ModuleID, artifact: Artifact) =>
artifact.name + "_" + sv.binary + "-" + sparkVersion.value + "_" + module.revision + "." + artifact.extension
}
lazy val doobieVersion = "0.8.6"
// Dependencies
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.0" % "provided",
"org.apache.spark" %% "spark-sql" % "2.4.0" % "provided",
"org.scalatest" %% "scalatest" % "3.0.8",
"org.apache.hadoop" % "hadoop-common" % "2.9.2" % "provided",
"org.apache.hadoop" % "hadoop-aws" % "2.9.2" % "provided",
"com.amazonaws" % "aws-java-sdk-s3" % "1.11.46",
"com.google.guava" % "guava" % "19.0",
"com.typesafe.slick" %% "slick" % "3.3.1",
"com.typesafe.slick" %% "slick-hikaricp" % "3.3.1",
"mysql" % "mysql-connector-java" % "6.0.6",
"com.microsoft.sqlserver" % "mssql-jdbc" % "8.2.0.jre8",
// "com.github.geirolz" %% "advxml" % "2.0.0-RC1",
"org.scalaj" %% "scalaj-http" % "2.4.2",
"org.json4s" %% "json4s-native" % "3.6.7",
"io.jvm.uuid" %% "scala-uuid" % "0.3.1"
)
// JVM Options
javaOptions ++= Seq("-Xms512m", "-Xmx2048M", "-XX:+CMSClassUnloadingEnabled")
// SBT Test Options
fork in Test := true
testOptions in Test += Tests.Argument(TestFrameworks.ScalaTest, "-oD")
assemblyExcludedJars in assembly := {
// Exclude conflicting jars
val cp = (fullClasspath in assembly).value
cp.filter { f =>
f.data.getName.contains("spark-core") ||
f.data.getName.contains("spark-sql")
}
}
// SBT Assembly Options
assemblyJarName in assembly := "blah.jar"
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case "reference.conf" => MergeStrategy.concat
case x => MergeStrategy.first
}
解决方案
我能够通过 SDK 运行我的程序。我需要做2个调整:
1) 在我的 env.sh 步骤中添加一个附加命令来更新 HADOOP_CLASSPATH:
sudo echo "export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:/usr/lib/spark/jars/*"" | sudo tee -a /etc/hadoop/conf/hadoop-env.sh
2)通过删除某些参数来更新我的步骤(在下面评论):
$flowObj->Steps = array(
[
'ActionOnFailure' => 'CONTINUE', // TERMINATE_CLUSTER
'HadoopJarStep' => array(
'Args' => array(
's3://parentFolder/subFolder/env.sh',
),
'Jar' => 's3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar'
),
'Name' => 'hadoop-env.sh step' // The name of the step.
],
[
'ActionOnFailure' => 'TERMINATE_CLUSTER', // TERMINATE_CLUSTER
'HadoopJarStep' => array(
'Args' => array(
'spark-submit',
//'--deploy-mode',
//'cluster',
//'yarn'
'--class',
'project.DataProcessor',
's3://parentFolder/subFolder/Project.jar'
), // A list of command line arguments passed to the JAR file's main function when executed.
'Jar' => 'command-runner.jar', // A path to a JAR file run during the step.
//'MainClass' => 'project.DataProcessor', // this is already specified in the Manifest of the fat JAR
)
,
'Name' => 'Spark Step' // The name of the step.
]);
推荐阅读
- swiftui - MapKit 使用 SwiftUI 2 显示当前位置(不使用 UIRepresentable)
- java - 理解异常的麻烦:“无法从 START_OBJECT 令牌中反序列化 `java.lang.String` 的实例”在 Jackson 中使用 ObjectMapper
- assembly - 分支和跳转中的地址
- c - C 中的自定义 shell:重定向 I/O
- google-cloud-platform - 将第三方 VPC 与现有 VPC 网络对等互连设置 - GCP
- python - python - 如何在Python中按一列的最大值和另一列的最小值之间的差异按列分组?
- python - 如何在 Pytorch 中找到两组 2D 张量(2D 平面上的点)的交集
- version-control - 哪些版本控制系统适用于 Launchable?
- discord.js - 有人可以帮助我处理 discord.js 表情吗?
- python - 使用 NumPy 和 matplotlib 绘制的图表上未显示某些点