python - 使用 PySpark 连接速度很慢
问题描述
我正在使用以下代码玩 PySpark:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Scoring System").getOrCreate()
df = spark.read.csv('output.csv')
df.show()
我在命令行上运行 python trial.py 后大约 5 到 10 分钟,没有进展:
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2019-05-05 22:58:31 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2019-05-05 22:58:32 WARN Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
[Stage 0:> (0 + 0) / 1]2019-05-05 23:00:08 WARN YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-05-05 23:00:23 WARN YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-05-05 23:00:38 WARN YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-05-05 23:00:53 WARN YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
[Stage 0:> (0 + 0) / 1]2019-05-05 23:01:08 WARN YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-05-05 23:01:23 WARN YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-05-05 23:01:38 WARN YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
我预感我的工作节点中缺少资源(?),或者我错过了什么?
解决方案
尝试增加Executor的数量和内存 pyspark --num-executors 5 --executor-memory 1G
推荐阅读
- java - 如何使用 javax.xml.bind 将 XML 写入带有 xmlns 字段的 POJO?
- r - split-lapply- 组合一个大数据帧以避免内存问题?
- android - 如何在我的联系人列表中搜索和拨打电话?
- r - 闪亮的仪表板 tabitem 没有显示
- flask - Flask:无法解析包中的端点位置
- javascript - 如何处理在我的 React 原生项目中查看的阿拉伯语和英语数据?
- javascript - 在入口组件中使用组件?
- java - 在 dockerfile 中使用 oracle-serverjre:8 的问题
- java - 如何确定性地将顺序整数映射到均匀分布的双精度数
- abap - 如何在 SAP Dictionary 表中查找 ForeignKey?