首页 > 解决方案 > 无法在 Pycharm 上正确运行 PySpark

问题描述

我的 Windows 中安装了 PySpark 3.1.2 和 Python 3.8.3。所有路径也在环境变量、spark_home、hadoop_home 和路径中正确设置。当我尝试运行此代码时,我仍然面临以下错误。错误是系统找不到指定的文件。

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data2 = [("James", "abs"),
     ("Michael", "Rose"),
     ]
schema = StructType([ \
StructField("firstname", StringType(), True), \
StructField("middlename", StringType(), True), \
])
df = spark.createDataFrame(data=data2, schema=schema)
df.printSchema()
df.show(truncate=False)

错误如下。

21/09/01 14:36:19 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot 
find the file specified
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:165)
.....
.....
1/09/01 14:36:19 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
File "C:/Users/abc123/PycharmProjects/pythonProject/test.py", line 18, in <module>
df.show(truncate=False)
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\sql\dataframe.py", 
line 486, in show
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", 
line 1304, in __call__
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\sql\utils.py", line 
111, in deco
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", 
line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o39.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 
failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (B****.a*.******.com 
executor driver): java.io.IOException: Cannot run program "python3": CreateProcess error=2, 
The system cannot find the file specified
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)
....

直到 df.printschema() 它工作正常,但是当我尝试运行 df.show()、df.count() 之类的操作时,就会出现上述错误。所有路径都在我的环境变量中正确设置。Python 也运行正常。但仍然无法解决这个问题。请指导我解决上述问题。

标签: pythonwindowsapache-sparkpysparkpycharm

解决方案


推荐阅读