pyspark - 使用架构详细信息创建数据框时 Dataproc 上的 Pyspark 错误
问题描述
我有一个带有 Anaconda 的 Dataproc 集群。我创建了一个虚拟环境。在 anaconda 内部,my-env
因为我需要在那里安装开源 RDkit,因此我再次安装了 PySpark(不使用预安装的)。现在使用下面的代码我得到错误my-env
但不是在外面my-env
代码:
from pyspark.sql.types import StructField, StructType, StringType, LongType
from pyspark.sql import SparkSession
from py4j.protocol import Py4JJavaError
spark = SparkSession.builder.appName("test").getOrCreate()
fields = [StructField("col0", StringType(), True),
StructField("col1", StringType(), True),
StructField("col2", StringType(), True),
StructField("col3", StringType(), True)]
schema = StructType(fields)
chem_info = spark.createDataFrame([], schema)
这是我得到的错误:
File
"/home/.conda/envs/my-env/lib/python3.6/site-packages/pyspark/sql/session.py",
line 749, in createDataFrame
jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) File
"/home/.conda/envs/my-env/lib/python3.6/site-packages/pyspark/rdd.py",
line 2297, in _to_java_object_rdd
rdd = self._pickled() File "/home/.conda/envs/my-env/lib/python3.6/site-packages/pyspark/rdd.py",
line 196, in _pickled
return self._reserialize(AutoBatchedSerializer(PickleSerializer())) File
"/home/.conda/envs/my-env/lib/python3.6/site-packages/pyspark/rdd.py",
line 594, in _reserialize
self = self.map(lambda x: x, preservesPartitioning=True) File "/home/.conda/envs/my-env/lib/python3.6/site-packages/pyspark/rdd.py",
line 325, in map
return self.mapPartitionsWithIndex(func, preservesPartitioning) File
"/home/.conda/envs/my-env/lib/python3.6/site-packages/pyspark/rdd.py",
line 365, in mapPartitionsWithIndex
return PipelinedRDD(self, f, preservesPartitioning) File "/home/.conda/envs/my-env/lib/python3.6/site-packages/pyspark/rdd.py",
line 2514, in __init__
self.is_barrier = prev._is_barrier() or isFromBarrier File "/home/.conda/envs/my-env/lib/python3.6/site-packages/pyspark/rdd.py",
line 2414, in _is_barrier
return self._jrdd.rdd().isBarrier() File "/home/.conda/envs/my-env/lib/python3.6/site-packages/py4j/java_gateway.py",
line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name) File "/home/.conda/envs/my-env/lib/python3.6/site-packages/pyspark/sql/utils.py",
line 63, in deco
return f(*a, **kw) File "/home/.conda/envs/my-env/lib/python3.6/site-packages/py4j/protocol.py",
line 332, in get_return_value
format(target_id, ".", name, value)) py4j.protocol.Py4JError: An error occurred while calling o57.isBarrier. Trace: py4j.Py4JException:
Method isBarrier([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
你能帮我解决吗?
解决方案
如pyspark: Method isBarrier([]) does not exist问题中所述,此错误是由安装在 Dataproc 集群中的不同版本的 Spark 与您在 conda 环境中手动安装的 PySpark 之间不兼容引起的。
要解决此问题,您需要检查集群上的 Spark 版本并安装适当版本的 PySpark:
$ spark-submit --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.4
/_/
Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_232
$ conda install pyspark==2.4.4
推荐阅读
- azure-devops - 为特定 Azure 用户创建服务原则
- blender - 在单个网格处为两个段应用两种材料
- python - 无法在 spaCY 中将 ORTH 转换为字符串
- hbase - HBase 区域无法拆分
- reactjs - Material-UI : Rating Component,根据值改变颜色
- javascript - PayPal JavaScript 集成 - 打开新选项卡
- java - 在 java 类中使用启用 JNI 的 Lib
- pytorch - 从 Pytorch 张量中排除索引
- salesforce - APIQuickAction 提交后刷新页面
- javascript - 如何创建多个切换按钮?