首页 > 解决方案 > PySpark EMR 步骤失败,退出代码为 1

问题描述

我正在学习第一次使用 AWS EMR 提交我的 Spark 作业。我使用的脚本很短(restaurant.py):

from pyspark import SparkContext, SQLContext
from pyspark.sql import SparkSession

class SparkRawConsumer:

def __init__(self):
    self.sparkContext = SparkContext.getOrCreate()

    self.sparkContext.setLogLevel("ERROR")
    self.sqlContext = SQLContext(self.sparkContext)
    self.df = self.sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('zomato.csv')


if __name__ == "__main__":
    sparkConsumer = SparkRawConsumer()
    print(sparkConsumer.df.count())
    sparkConsumer.df.groupBy("City").agg({"Average Cost for two": "avg", "Aggregate rating": "avg"})

我使用 AWS GUI 提交我的步骤,但 CLI 导出是

spark-submit --deploy-mode cluster s3://data-pipeline-testing-yu-chen/dependencies/restaurant.py -files s3://data-pipeline-testing-yu-chen/dependencies/zomato.csv

但是,该步骤将运行几分钟,然后返回退出代码 1。我很困惑到底发生了什么,并且发现很难解释我的 syserr 的输出:

18/07/28 06:40:10 INFO Client: Application report for application_1532756827478_0012 (state: RUNNING)
18/07/28 06:40:11 INFO Client: Application report for application_1532756827478_0012 (state: RUNNING)
18/07/28 06:40:12 INFO Client: Application report for application_1532756827478_0012 (state: RUNNING)
18/07/28 06:40:13 INFO Client: Application report for application_1532756827478_0012 (state: FINISHED)
18/07/28 06:40:13 INFO Client: 
     client token: N/A
     diagnostics: User application exited with status 1
     ApplicationMaster host: myip
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1532759825922
     final status: FAILED
     tracking URL: http://myip.compute.internal:20888/proxy/application_1532756827478_0012/
     user: hadoop
18/07/28 06:40:13 INFO Client: Deleted staging directory hdfs://myip.compute.internal:8020/user/hadoop/.sparkStaging/application_1532756827478_0012
Exception in thread "main" org.apache.spark.SparkException: Application application_1532756827478_0012 finished with failed status
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1165)
    at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1520)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/07/28 06:40:13 INFO ShutdownHookManager: Shutdown hook called
18/07/28 06:40:13 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-dedwd323x
18/07/28 06:40:13 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-dedwd323x
Command exiting with ret '1'

我可以通过 SSH 连接到我的主实例来运行脚本,然后发出spark-submit restaurant.py. 我已使用以下方法将 CSV 文件加载到我的主实例中:

[hadoop@my-ip ~]$ aws s3 sync s3://data-pipeline-testing-yu-chen/dependencies/ .

然后我将我的restaurant.csv文件加载到 HDFS 中:

hadoop fs -put zomato.csv ./zomato.csv

我的猜测是,-files我传入的选项并没有按照我想要的方式使用,但我真的不知道如何解释控制台输出并开始调试。

标签: apache-sparkpysparkamazon-emr

解决方案


推荐阅读