apache-spark - PySpark EMR 步骤失败,退出代码为 1
问题描述
我正在学习第一次使用 AWS EMR 提交我的 Spark 作业。我使用的脚本很短(restaurant.py):
from pyspark import SparkContext, SQLContext
from pyspark.sql import SparkSession
class SparkRawConsumer:
def __init__(self):
self.sparkContext = SparkContext.getOrCreate()
self.sparkContext.setLogLevel("ERROR")
self.sqlContext = SQLContext(self.sparkContext)
self.df = self.sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('zomato.csv')
if __name__ == "__main__":
sparkConsumer = SparkRawConsumer()
print(sparkConsumer.df.count())
sparkConsumer.df.groupBy("City").agg({"Average Cost for two": "avg", "Aggregate rating": "avg"})
我使用 AWS GUI 提交我的步骤,但 CLI 导出是
spark-submit --deploy-mode cluster s3://data-pipeline-testing-yu-chen/dependencies/restaurant.py -files s3://data-pipeline-testing-yu-chen/dependencies/zomato.csv
但是,该步骤将运行几分钟,然后返回退出代码 1。我很困惑到底发生了什么,并且发现很难解释我的 syserr 的输出:
18/07/28 06:40:10 INFO Client: Application report for application_1532756827478_0012 (state: RUNNING)
18/07/28 06:40:11 INFO Client: Application report for application_1532756827478_0012 (state: RUNNING)
18/07/28 06:40:12 INFO Client: Application report for application_1532756827478_0012 (state: RUNNING)
18/07/28 06:40:13 INFO Client: Application report for application_1532756827478_0012 (state: FINISHED)
18/07/28 06:40:13 INFO Client:
client token: N/A
diagnostics: User application exited with status 1
ApplicationMaster host: myip
ApplicationMaster RPC port: 0
queue: default
start time: 1532759825922
final status: FAILED
tracking URL: http://myip.compute.internal:20888/proxy/application_1532756827478_0012/
user: hadoop
18/07/28 06:40:13 INFO Client: Deleted staging directory hdfs://myip.compute.internal:8020/user/hadoop/.sparkStaging/application_1532756827478_0012
Exception in thread "main" org.apache.spark.SparkException: Application application_1532756827478_0012 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1165)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1520)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/07/28 06:40:13 INFO ShutdownHookManager: Shutdown hook called
18/07/28 06:40:13 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-dedwd323x
18/07/28 06:40:13 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-dedwd323x
Command exiting with ret '1'
我可以通过 SSH 连接到我的主实例来运行脚本,然后发出spark-submit restaurant.py
. 我已使用以下方法将 CSV 文件加载到我的主实例中:
[hadoop@my-ip ~]$ aws s3 sync s3://data-pipeline-testing-yu-chen/dependencies/ .
然后我将我的restaurant.csv
文件加载到 HDFS 中:
hadoop fs -put zomato.csv ./zomato.csv
我的猜测是,-files
我传入的选项并没有按照我想要的方式使用,但我真的不知道如何解释控制台输出并开始调试。
解决方案
推荐阅读
- react-native - 试图在本机反应中获取firebase图像存储url
- c# - 模型绑定忽略具有 JsonIgnore 属性的属性
- python - How to build a Python function to build Pandas DataFrames dynamically?
- r - 显示 3 个因素 ggplot geom
- python - 在 Pandas 中出现错误:IndexError: index 0 is out of bounds for axis 0 with size 0
- c# - How to store and list out variables in Selenium code?
- c# - Unity Windows Overlay
- postgresql - st_intersects 返回的观测值过多
- html - @media only screen and (max-width: --px) not working?
- wordpress - Udating custom WooCommerce mini cart when cart contents get changed