首页 > 解决方案 > 在单台机器上在 Jupyter notebook 中测试 Pyspark 时出现 Py4JJavaError

问题描述

我是 Spark 的新手,最近使用自制软件将它安装在 Mac 上(系统中有 Python 2.7):

brew install apache-spark

然后在我安装了 python 3.6 的虚拟环境中使用 pip3 安装 Pyspark。

/Users/xxx/venv/bin/python /Users/xxx/venv/bin/pip3 install pyspark

当我在 Jupyter Notebook 中运行以下代码以测试 Spark 是否在单台机器上运行时:

from pyspark.context import SparkContext
sc = SparkContext.getOrCreate()

import random
num_samples = 100000000

def inside(p):     
    x, y = random.random(), random.random()
    return x*x + y*y < 1

count = sc.parallelize(range(0, num_samples)).filter(inside).count()

pi = 4 * count / num_samples
print(pi)

sc.stop()

我在sc.parallelize中遇到了以下错误:

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-3-482026ac7386> in <module>
      8     return x*x + y*y < 1
      9 
---> 10 count = sc.parallelize(range(0, num_samples)).filter(inside).count()
     11 
     12 pi = 4 * count / num_samples
~/venv/deep_learning/lib/python3.6/site-packages/pyspark/rdd.py in count(self)
   1139         3
   1140         """
-> 1141         return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   1142 
   1143     def stats(self):

~/venv/deep_learning/lib/python3.6/site-packages/pyspark/rdd.py in sum(self)
   1130         6.0
   1131         """
-> 1132         return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
   1133 
   1134     def count(self):

~/venv/deep_learning/lib/python3.6/site-packages/pyspark/rdd.py in fold(self, zeroValue, op)
   1001         # zeroValue provided to each partition is unique from the one provided
   1002         # to the final reduce call
-> 1003         vals = self.mapPartitions(func).collect()
   1004         return reduce(op, vals, zeroValue)
   1005 

~/venv/deep_learning/lib/python3.6/site-packages/pyspark/rdd.py in collect(self)
    887         
    888         with SCCallSiteSync(self.context) as css:
-> 889             sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
    890         return list(_load_from_socket(sock_info, self._jrdd_deserializer))
    891 

~/venv/deep_learning/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307         for temp_arg in temp_args:

~/venv/deep_learning/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
-> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 0.0 failed 1 times, most recent failure: Lost task 2.0 in stage 0.0 (TID 2, 192.168.0.15, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):

  File "/Users/xxx/venv/deep_learning/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 477, in main
    ("%d.%d" % sys.version_info[:2], version))

Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

我在 /.bash_profile 中配置了 Pyspark 路径,如下所示:

export SPARK_HOME=/usr/local/Cellar/apache-spark/3.0.1/libexec
export PYTHONPATH=/usr/local/Cellar/apache-spark/3.0.1/libexec/python/:$PYTHONP$
export PYSPARK_PYTHON=/Users/xxx/venv/bin/python
export PYSPARK_DRIVER_PYTHON=/Users/xxx/venv/bin/python

有谁知道我在这里做错了什么?任何建议将不胜感激。

标签: apache-sparkpysparkjupyter-notebookhomebrew

解决方案


似乎这个问题与 Pyspark 有特殊关系。该问题可以通过使用 findspark 包来解决。以下是findspark自述文件的引用:

PySpark 默认不在 sys.path 上,但这并不意味着它不能用作常规库。您可以通过将 pyspark 符号链接到您的站点包或在运行时将 pyspark 添加到 sys.path 来解决此问题。findspark 是后者。

在启动 SparkContext 之前添加下面的代码可以解决问题:

import findspark
findspark.init()

推荐阅读