首页 > 解决方案 > Linux 上的 Spark 错误:线程“main”中的异常 java.io.IOException:无法运行程序“python”:错误 = 2,没有这样的文件或目录

问题描述

我在学习 Spark,第二版的第 2 章。当我去执行示例 mnmcont.py 脚本时,我收到以下错误:

21/02/08 11:40:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Exception in thread "main" java.io.IOException: Cannot run program "python": error=2, No such file or directory

我用来执行脚本的命令是:

$SPARK_HOME/bin/spark-submit mnmcount.py data/mnm_dataset.csv 

我在 LearningSparkV2-master/chapter2/py/src 目录中

在我的 bashrc 文件中,我添加了以下几行并获取了该文件。

SPARK_HOME="/usr/local/spark"
alias python="python3"
export JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"

mnmcount.py 脚本的完整代码如下。

from __future__ import print_function

import sys

from pyspark.sql import SparkSession
from pyspark.sql.functions import count

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: mnmcount <file>", file=sys.stderr)
        sys.exit(-1)

    spark = (SparkSession
        .builder
        .appName("PythonMnMCount")
        .getOrCreate())
    # get the M&M data set file name
    mnm_file = sys.argv[1]
    # read the file into a Spark DataFrame
    mnm_df = (spark.read.format("csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load(mnm_file))
    mnm_df.show(n=5, truncate=False)

    # aggregate count of all colors and groupBy state and color
    # orderBy descending order
    count_mnm_df = (mnm_df.select("State", "Color", "Count")
                    .groupBy("State", "Color")
                    .sum("Count")
                    .orderBy("sum(Count)", ascending=False))

    # show all the resulting aggregation for all the dates and colors
    count_mnm_df.show(n=60, truncate=False)
    print("Total Rows = %d" % (count_mnm_df.count()))

    # find the aggregate count for California by filtering
    ca_count_mnm_df = (mnm_df.select("*")
                       .where(mnm_df.State == 'CA')
                       .groupBy("State", "Color")
                       .sum("Count")
                       .orderBy("sum(Count)", ascending=False))

    # show the resulting aggregation for California
    ca_count_mnm_df.show(n=10, truncate=False)

标签: apache-sparkpysparklinux-mint

解决方案


添加后

export PYSPARK_PYTHON=python3

到 bashrc 问题已解决。


推荐阅读