python - Spark df.cache() 导致 org.apache.spark.memory.SparkOutOfMemoryError
问题描述
我遇到了这个问题,一切正常,但是当我使用df.cache()
它时会导致org.apache.spark.memory.SparkOutOfMemoryError
问题。
有没有人遇到过类似的问题?
谢谢!
这是导致问题的代码行:
df.cache()
df = df.select(
*df.columns,
f.greatest(*custom_columns).alias(f"custom_max"),
f.least(*custom_columns).alias(f"custom_min"),
)
return df
这是错误日志:
[2021-04-17 08:54:35,894] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes [Stage 2:=======================================> (144 + 4) / 200]
[2021-04-17 08:54:54,481] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes [Stage 2:========================================> (148 + 4) / 200]2021-04-17 08:54:54 WARN MemoryStore:66 - Failed to reserve initial memory threshold of 1024.0 KB for computing block rdd_173_11 in memory.
[2021-04-17 08:54:54,494] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes 2021-04-17 08:54:54 WARN MemoryStore:66 - Not enough space to cache rdd_173_11 in memory! (computed 384.0 B so far)
[2021-04-17 08:54:54,495] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes 2021-04-17 08:54:54 WARN MemoryStore:66 - Failed to reserve initial memory threshold of 1024.0 KB for computing block rdd_173_17 in memory.
[2021-04-17 08:54:54,496] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes 2021-04-17 08:54:54 WARN MemoryStore:66 - Not enough space to cache rdd_173_17 in memory! (computed 384.0 B so far)
[2021-04-17 08:54:54,505] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes 2021-04-17 08:54:54 WARN BlockManager:66 - Persisting block rdd_173_11 to disk instead.
====
There are a bunch of WARN like this in between
====
[2021-04-17 08:55:13,178] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes 2021-04-17 08:55:13 WARN MemoryStore:66 - Not enough space to cache rdd_285_0 in memory! (computed 384.0 B so far)
[2021-04-17 08:55:13,242] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes 2021-04-17 08:55:13 ERROR Executor:91 - Exception in task 4.0 in stage 2.0 (TID 154)
[2021-04-17 08:55:13,242] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 65536 bytes of memory, got 0
[2021-04-17 08:55:13,242] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes at org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
[2021-04-17 08:55:13,242] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98)
[2021-04-17 08:55:13,242] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.<init>(UnsafeInMemorySorter.java:128)
[2021-04-17 08:55:13,242] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.<init>(UnsafeExternalSorter.java:161)
[2021-04-17 08:55:13,242] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:128)
[2021-04-17 08:55:13,243] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes at org.apache.spark.sql.execution.UnsafeExternalRowSorter.<init>(UnsafeExternalRowSorter.java:108)
[2021-04-17 08:55:13,243] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes at org.apache.spark.sql.execution.UnsafeExternalRowSorter.create(UnsafeExternalRowSorter.java:93)
[2021-04-17 08:55:13,243] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes at org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:87)
[2021-04-17 08:55:13,243] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes at org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:102)
[2021-04-17 08:55:13,243] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes at org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:101)
[2021-04-17 08:55:13,243] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
[2021-04-17 08:55:13,243] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
[2021-04-17 08:55:13,243] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
[2021-04-17 08:55:13,243] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
[2021-04-17 08:55:13,243] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
[2021-04-17 08:55:13,244] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
[2021-04-17 08:55:13,244] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes at org.apache.spark.scheduler.Task.run(Task.scala:109)
[2021-04-17 08:55:13,244] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
[2021-04-17 08:55:13,244] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2021-04-17 08:55:13,244] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2021-04-17 08:55:13,244] {base_task_runner.py:115} INFO - Job 19: Subtask compute_outcomes at java.lang.Thread.run(Thread.java:748)
解决方案
df.cache
会将您的整个数据帧缓存到执行程序的内存中。所以这个错误意味着你的执行者没有足够的内存。你需要给它更多使用--executor-memory
推荐阅读
- selenium - 如何使用时间睡眠使硒输出一致
- python - 如何从不和谐中获取自定义用户状态
- swift - Firebase 分析未显示任何数据
- node.js - 格式错误应该在 NodeJS + ExpressJS + TSLint + Prettier 中捆绑失败
- python - Spark:将查询分解为几个 dfs 更好,还是一次性完成?
- html - NVDA 屏幕阅读器使用 ARIA 获取复选框状态
- python - matplotlib - 如何恢复以前的样式?
- graphics - 如何将搅拌机 osm 导出到 godot,以便我可以将其用作汽车轨道?
- javascript - 画布填充视口并保持图像比例
- java - 是否存在“|” 查询 PARAM 值中的 [Bitwise Or] 对 Java Hibernate 性能有影响