apache-spark - 从本地 jupyter notebook 连接到 spark 集群
问题描述
我尝试从本地机器上的笔记本连接到远程 spark master。
当我尝试创建 sparkContext
sc = pyspark.SparkContext(master = "spark://remote-spark-master-hostname:7077",
appName="jupyter notebook_test"),
我得到以下异常:
/opt/.venv/lib/python3.7/site-packages/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
134 try:
135 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
--> 136 conf, jsc, profiler_cls)
137 except:
138 # If an error occurs, clean up in order to allow future SparkContext creation:
/opt/.venv/lib/python3.7/site-packages/pyspark/context.py in _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, jsc, profiler_cls)
196
197 # Create the Java SparkContext through Py4J
--> 198 self._jsc = jsc or self._initialize_context(self._conf._jconf)
199 # Reset the SparkConf to the one actually used by the SparkContext in JVM.
200 self._conf = SparkConf(_jconf=self._jsc.sc().conf())
/opt/.venv/lib/python3.7/site-packages/pyspark/context.py in _initialize_context(self, jconf)
304 Initialize SparkContext in function to allow subclass specific initialization
305 """
--> 306 return self._jvm.JavaSparkContext(jconf)
307
308 @classmethod
/opt/.venv/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
1523 answer = self._gateway_client.send_command(command)
1524 return_value = get_return_value(
-> 1525 answer, self._gateway_client, None, self._fqn)
1526
1527 for temp_arg in temp_args:
/opt/.venv/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.metrics.MetricsSystem.getServletHandlers(MetricsSystem.scala:91)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:516)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:745)
同时,我可以在交互模式下使用相同的解释器创建火花上下文。
我应该怎么做才能从我的本地 jupyter notebook 连接到远程 spark master?
解决方案
我使用@HristoIliev 的建议解决了我的问题。就我而言,PYSPARK_PYTHON
没有在 jupyter 环境中设置。简单的解决方案:
import os
os.environ["PYSPARK_PYTHON"] = '/opt/.venv/bin/python'
os.environ["SPARK_HOME"] = '/opt/spark'
您也可以使用findspark
它,但我没有测试它。
推荐阅读
- javascript - 如何单元测试使用 yargs 的应用程序在多种语言环境/语言中提供正确的响应?
- npm - 清理纱线中旧的中间依赖项的可靠方法是什么?
- laravel - 无法使用 PayPal PHP SDK 从 PayPal 获取所有交易
- node.js - 使用 Cloud Functions 在 GCP 上调度 Node.js 脚本
- graphql - 结合 GraphQL 查询
- mysql - 尝试编写 SQL 过程的语法错误
- jquery - 使用复选框插件时预先选择数据表的行
- spring-boot - Spring Boot:违反参照完整性约束
- c# - Simple Injector - 在运行时基于指定的泛型类型注入服务
- mysql - mySQL:从不同的表中选择和计数(*)?