首页 > 解决方案 > Zeppelin 0.9 在 YARN 客户端模式下运行 Spark 笔记本,但不是 YARN 集群

问题描述

我刚刚设置了 Zeppelin(0.9 版)与 Hadoop(3.3.0)+Spark(3.1.2)一起运行。我正在尝试运行 Zeppelin 网站中给出的示例代码:

val bankText = sc.textFile("file:///path/to/bank/bank-full.csv")

case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer)

// split each line, filter out header (starts with "age"), and map it into Bank case class
val bank = bankText.map(s=>s.split(";")).filter(s=>s(0)!="\"age\"").map(
    s=>Bank(s(0).toInt, 
            s(1).replaceAll("\"", ""),
            s(2).replaceAll("\"", ""),
            s(3).replaceAll("\"", ""),
            s(5).replaceAll("\"", "").toInt
        )
)

// convert to DataFrame and create temporal table
bank.toDF().registerTempTable("bank")

接着:

%sql select age, count(1) from bank where age < 30 group by age order by age

当我在 YARN 客户端模式下运行它时,它可以正常工作并产生预期的输出。但是当我尝试在 YARN 集群模式下运行时,它给了我以下错误:

 INFO [2021-07-23 06:05:49,623] ({SchedulerFactory6} ProcessLauncher.java[transition]:109) - Process state is transitioned to LAUNCHED
 INFO [2021-07-23 06:05:49,624] ({SchedulerFactory6} ProcessLauncher.java[launch]:96) - Process is launched: [/opt/zeppelin/bin/interpreter.sh, -d, /opt/zeppelin/interpreter/spark, -c, 10.0.15.21, -p, 37921, -r, :, -i, spark-shared_process, -l, /opt/zeppelin/local-repo/spark, -g, spark]
 WARN [2021-07-23 06:05:59,666] ({Exec Default Executor} RemoteInterpreterManagedProcess.java[onProcessComplete]:255) - Process is exited with exit value 0
 INFO [2021-07-23 06:05:59,667] ({Exec Default Executor} ProcessLauncher.java[transition]:109) - Process state is transitioned to COMPLETED
 INFO [2021-07-23 06:06:06,192] ({pool-7-thread-5} RemoteInterpreterEventServer.java[registerInterpreterProcess]:176) - Register interpreter process: 10.0.15.21:43289, interpreterGroup: spark-shared_process
 INFO [2021-07-23 06:06:06,193] ({pool-7-thread-5} ProcessLauncher.java[transition]:109) - Process state is transitioned to RUNNING
 INFO [2021-07-23 06:06:06,193] ({SchedulerFactory6} RemoteInterpreterManagedProcess.java[start]:132) - Detected yarn app: application_1627000352599_0011, add it to YarnAppMonitor
ERROR [2021-07-23 06:06:06,194] ({SchedulerFactory6} Job.java[run]:174) - Job failed
java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
    at org.apache.zeppelin.interpreter.remote.RemoteInterpreterManagedProcess.start(RemoteInterpreterManagedProcess.java:133)
    at org.apache.zeppelin.interpreter.ManagedInterpreterGroup.getOrCreateInterpreterProcess(ManagedInterpreterGroup.java:68)
    at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getOrCreateInterpreterProcess(RemoteInterpreter.java:104)
    at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.internal_create(RemoteInterpreter.java:154)
    at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.open(RemoteInterpreter.java:126)
    at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:271)
    at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:444)
    at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:72)
    at org.apache.zeppelin.scheduler.Job.run(Job.java:172)
    at org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:132)
    at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:182)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    ... 18 more
ERROR [2021-07-23 06:06:06,195] ({SchedulerFactory6} NotebookServer.java[onStatusChange]:1920) - Error
java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
    at org.apache.zeppelin.interpreter.remote.RemoteInterpreterManagedProcess.start(RemoteInterpreterManagedProcess.java:133)
    at org.apache.zeppelin.interpreter.ManagedInterpreterGroup.getOrCreateInterpreterProcess(ManagedInterpreterGroup.java:68)
    at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getOrCreateInterpreterProcess(RemoteInterpreter.java:104)
    at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.internal_create(RemoteInterpreter.java:154)
    at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.open(RemoteInterpreter.java:126)
    at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:271)
    at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:444)
    at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:72)
    at org.apache.zeppelin.scheduler.Job.run(Job.java:172)
    at org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:132)
    at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:182)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)

我不确定该怎么做。我尝试通过获取 Hadoop 类路径hadoop classpath并返回:

/opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/hdfs:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/share/hadoop/yarn:/opt/hadoop/share/hadoop/yarn/lib/*:/opt/hadoop/share/hadoop/yarn/*

然后我将它添加到spark-defaults.confinspark.driver.extraClassPathspark.executor.extraClassPathsince org.apache.hadoop.conf.Configurationis inside hadoop-commons-3.3.0.jar。我还确保在以下内容中包含适当的信息zeppelin-env.sh

export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop/
export USE_HADOOP=true
export SPARK_HOME=/opt/spark

但由于某种原因,我似乎无法通过 Zeppelin 以 YARN 集群模式进行部署(如果我spark-submit自己使用,我可以。)

注意:我禁用zeppelin.spark.enableSupportedVersionCheck了因为我使用的是 Spark 3.1.2 和 Zeppelin 官方支持的 Spark 3.0 版本。

标签: apache-sparkhadoophdfshadoop-yarnapache-zeppelin

解决方案


推荐阅读