apache-spark - Zeppelin 0.9 在 YARN 客户端模式下运行 Spark 笔记本,但不是 YARN 集群
问题描述
我刚刚设置了 Zeppelin(0.9 版)与 Hadoop(3.3.0)+Spark(3.1.2)一起运行。我正在尝试运行 Zeppelin 网站中给出的示例代码:
val bankText = sc.textFile("file:///path/to/bank/bank-full.csv")
case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer)
// split each line, filter out header (starts with "age"), and map it into Bank case class
val bank = bankText.map(s=>s.split(";")).filter(s=>s(0)!="\"age\"").map(
s=>Bank(s(0).toInt,
s(1).replaceAll("\"", ""),
s(2).replaceAll("\"", ""),
s(3).replaceAll("\"", ""),
s(5).replaceAll("\"", "").toInt
)
)
// convert to DataFrame and create temporal table
bank.toDF().registerTempTable("bank")
接着:
%sql select age, count(1) from bank where age < 30 group by age order by age
当我在 YARN 客户端模式下运行它时,它可以正常工作并产生预期的输出。但是当我尝试在 YARN 集群模式下运行时,它给了我以下错误:
INFO [2021-07-23 06:05:49,623] ({SchedulerFactory6} ProcessLauncher.java[transition]:109) - Process state is transitioned to LAUNCHED
INFO [2021-07-23 06:05:49,624] ({SchedulerFactory6} ProcessLauncher.java[launch]:96) - Process is launched: [/opt/zeppelin/bin/interpreter.sh, -d, /opt/zeppelin/interpreter/spark, -c, 10.0.15.21, -p, 37921, -r, :, -i, spark-shared_process, -l, /opt/zeppelin/local-repo/spark, -g, spark]
WARN [2021-07-23 06:05:59,666] ({Exec Default Executor} RemoteInterpreterManagedProcess.java[onProcessComplete]:255) - Process is exited with exit value 0
INFO [2021-07-23 06:05:59,667] ({Exec Default Executor} ProcessLauncher.java[transition]:109) - Process state is transitioned to COMPLETED
INFO [2021-07-23 06:06:06,192] ({pool-7-thread-5} RemoteInterpreterEventServer.java[registerInterpreterProcess]:176) - Register interpreter process: 10.0.15.21:43289, interpreterGroup: spark-shared_process
INFO [2021-07-23 06:06:06,193] ({pool-7-thread-5} ProcessLauncher.java[transition]:109) - Process state is transitioned to RUNNING
INFO [2021-07-23 06:06:06,193] ({SchedulerFactory6} RemoteInterpreterManagedProcess.java[start]:132) - Detected yarn app: application_1627000352599_0011, add it to YarnAppMonitor
ERROR [2021-07-23 06:06:06,194] ({SchedulerFactory6} Job.java[run]:174) - Job failed
java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterManagedProcess.start(RemoteInterpreterManagedProcess.java:133)
at org.apache.zeppelin.interpreter.ManagedInterpreterGroup.getOrCreateInterpreterProcess(ManagedInterpreterGroup.java:68)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getOrCreateInterpreterProcess(RemoteInterpreter.java:104)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.internal_create(RemoteInterpreter.java:154)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.open(RemoteInterpreter.java:126)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:271)
at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:444)
at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:72)
at org.apache.zeppelin.scheduler.Job.run(Job.java:172)
at org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:132)
at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:182)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 18 more
ERROR [2021-07-23 06:06:06,195] ({SchedulerFactory6} NotebookServer.java[onStatusChange]:1920) - Error
java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterManagedProcess.start(RemoteInterpreterManagedProcess.java:133)
at org.apache.zeppelin.interpreter.ManagedInterpreterGroup.getOrCreateInterpreterProcess(ManagedInterpreterGroup.java:68)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getOrCreateInterpreterProcess(RemoteInterpreter.java:104)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.internal_create(RemoteInterpreter.java:154)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.open(RemoteInterpreter.java:126)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:271)
at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:444)
at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:72)
at org.apache.zeppelin.scheduler.Job.run(Job.java:172)
at org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:132)
at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:182)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
我不确定该怎么做。我尝试通过获取 Hadoop 类路径hadoop classpath
并返回:
/opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/hdfs:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/share/hadoop/yarn:/opt/hadoop/share/hadoop/yarn/lib/*:/opt/hadoop/share/hadoop/yarn/*
然后我将它添加到spark-defaults.conf
inspark.driver.extraClassPath
和spark.executor.extraClassPath
since org.apache.hadoop.conf.Configuration
is inside hadoop-commons-3.3.0.jar
。我还确保在以下内容中包含适当的信息zeppelin-env.sh
:
export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop/
export USE_HADOOP=true
export SPARK_HOME=/opt/spark
但由于某种原因,我似乎无法通过 Zeppelin 以 YARN 集群模式进行部署(如果我spark-submit
自己使用,我可以。)
注意:我禁用zeppelin.spark.enableSupportedVersionCheck
了因为我使用的是 Spark 3.1.2 和 Zeppelin 官方支持的 Spark 3.0 版本。
解决方案
推荐阅读
- r - R 中的错误 - 包含保留字或非法字符
- python - 如何解决python for-loop中到达循环末尾时发生的Keyerror
- postgresql - 如何用 Z 转换时间戳并计算 postregsql 中的天数?
- node.js - 如何添加节点终端 Visual Studio Code?
- lammps - 如何计算两个原子间的最小距离?
- jpa - EclipseLink JPA - 命名查询 - 在复合键中搜索部分键
- typescript - VSCode 可以自动为 TypeScript 创建回调方法脚手架吗?
- java - 从 ViewHolder 的 onLongClickListener 内部刷新 recyclerView
- reactjs - 在反应中使用组件内的图像
- python - 使用 HTML 页面作为默认的 Django 主页