python - Databricks 连接 & PyCharm & 远程 SSH 连接
问题描述
嘿 StackOverflowers!
我遇到了一个问题。
我已将 PyCharm 设置为通过 SSH 连接与(天蓝色)VM 连接。
我设置了映射
我通过在 vm 中启动一个终端来创建一个 conda 环境,然后我下载并连接到 databricks-connect。我在终端上测试它,它工作正常。
但是当我尝试运行 spark 会话 (spark = SparkSession.builder.getOrCreate()) 时,databricks-connect 会在错误的文件夹中搜索 .databricks-connect 文件并给我以下错误:
Caused by: java.lang.RuntimeException: Config file /root/.databricks-connect not found. Please run
数据块连接配置 to accept the end user license agreement and configure Databricks Connect. A copy of the EULA is provided below: Copyright (2018) Databricks, Inc.
以及完整的错误+一些警告。
20/07/10 17:23:05 WARN Utils: Your hostname, george resolves to a loopback address: 127.0.0.1; using 10.0.0.4 instead (on interface eth0)
20/07/10 17:23:05 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/07/10 17:23:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
File "/anaconda/envs/py37/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-2-23fe18298795>", line 1, in <module>
runfile('/home/azureuser/code/model/check_vm.py')
File "/home/azureuser/.pycharm_helpers/pydev/_pydev_bundle/pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "/home/azureuser/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/azureuser/code/model/check_vm.py", line 13, in <module>
spark = SparkSession.builder.getOrCreate()
File "/anaconda/envs/py37/lib/python3.7/site-packages/pyspark/sql/session.py", line 185, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/anaconda/envs/py37/lib/python3.7/site-packages/pyspark/context.py", line 373, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/anaconda/envs/py37/lib/python3.7/site-packages/pyspark/context.py", line 137, in __init__
conf, jsc, profiler_cls)
File "/anaconda/envs/py37/lib/python3.7/site-packages/pyspark/context.py", line 199, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/anaconda/envs/py37/lib/python3.7/site-packages/pyspark/context.py", line 312, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/anaconda/envs/py37/lib/python3.7/site-packages/py4j/java_gateway.py", line 1525, in __call__
answer, self._gateway_client, None, self._fqn)
File "/anaconda/envs/py37/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.ExceptionInInitializerError
at org.apache.spark.SparkContext.<init>(SparkContext.scala:99)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:250)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Config file /root/.databricks-connect not found. Please run `databricks-connect configure` to accept the end user license agreement and configure Databricks Connect. A copy of the EULA is provided below: Copyright (2018) Databricks, Inc.
This library (the "Software") may not be used except in connection with the Licensee's use of the Databricks Platform Services pursuant to an Agreement (defined below) between Licensee (defined below) and Databricks, Inc. ("Databricks"). This Software shall be deemed part of the “Subscription Services” under the Agreement, or if the Agreement does not define Subscription Services, then the term in such Agreement that refers to the applicable Databricks Platform Services (as defined below) shall be substituted herein for “Subscription Services.” Licensee's use of the Software must comply at all times with any restrictions applicable to the Subscription Services, generally, and must be used in accordance with any applicable documentation. If you have not agreed to an Agreement or otherwise do not agree to these terms, you may not use the Software. This license terminates automatically upon the termination of the Agreement or Licensee's breach of these terms.
Agreement: the agreement between Databricks and Licensee governing the use of the Databricks Platform Services, which shall be, with respect to Databricks, the Databricks Terms of Service located at www.databricks.com/termsofservice, and with respect to Databricks Community Edition, the Community Edition Terms of Service located at www.databricks.com/ce-termsofuse, in each case unless Licensee has entered into a separate written agreement with Databricks governing the use of the applicable Databricks Platform Services. Databricks Platform Services: the Databricks services or the Databricks Community Edition services, according to where the Software is used.
Licensee: the user of the Software, or, if the Software is being used on behalf of a company, the company.
To accept this agreement and start using Databricks Connect, run `databricks-connect configure` in a shell.
at com.databricks.spark.util.DatabricksConnectConf$.checkEula(DatabricksConnectConf.scala:41)
at org.apache.spark.SparkContext$.<init>(SparkContext.scala:2679)
at org.apache.spark.SparkContext$.<clinit>(SparkContext.scala)
... 13 more
但是,我没有对该文件夹的访问权限,因此我无法将 databricks 连接文件放在那里。
同样奇怪的是,如果我运行: Pycharm -> ssh terminal -> activate conda env -> python 以下
这是一种方法吗:
1. Point out to java where the databricks-connect file is
2. Configure databricks-connect in another way throughout the script or enviromental variables inside pycharm
3. Other way?
or do I miss something?
解决方案
从错误中我看到您需要接受 databricks 的条款和条件,其次按照 pycharm IDE databricks的这些说明
命令行界面
跑
databricks-connect configure
许可证显示:
复制到剪贴板复制 版权所有 (2018) Databricks, Inc.
除非与被许可人根据协议使用 Databricks 平台服务有关,否则不得使用此库(“软件”)......
接受许可并提供配置值。
Do you accept the above agreement? [y/N] y
设置新的配置值(将输入留空以接受默认值):Databricks 主机 [无当前值,必须以 https:// 开头]:Databricks 令牌 [无当前值]:集群 ID(例如,0921-001415-jelly628)[否当前值]:组织 ID(仅限 Azure,请参阅 URL 中的 ?o=orgId)[0]:端口 [15001]:
Databricks Connect 配置脚本会自动将包添加到您的项目配置中。
Python 3 集群 转到运行 > 编辑配置。
添加PYSPARK_PYTHON=python3作为环境变量。
Python 3 集群配置
推荐阅读
- python-3.x - 如何解决 NoneType 对象错误
- python-3.x - 将列表列表中的字符串连接 4
- python - 在数据框中使用带有 NaN 的 isin
- r - 从列表中选择特定对象并将它们合并在一起
- sql - Rmarkdown - 在动态 sql 块中使用表名作为变量?
- ruby-on-rails - 在graphql查询中返回mongodb动态属性
- android - 如何修复“Android Studio 最近的项目列表为空并想找回”
- javascript - 基于排序数组更新组件。角度 (4.3)
- r-markdown - 如何从多个数据集创建一个表?
- scala - 将scala数据农场中的值从一个单元格填充到其下方的其他单元格