首页 > 解决方案 > 优化高负载和 CPU 利用率的 EMR 作业

问题描述

我想优化 emr 工作。我检查了 Ganglia 报告(附件),它的 cpu 利用率很高。任何人都可以推荐如何使用各种机制进行优化

  1. 代码正在执行多个连接(6 个连接:一些是排序合并和广播)
  2. 写入 s3在此处输入图像描述

火花参数:

conf.set("spark.pyspark.python","python3"),
    conf.set("spark.executor.memory","18G")
    conf.set("spark.driver.memory","18G")
    conf.set("spark.executor.cores","5")
    conf.set("spark.num.executors","209")
    conf.set("spark.driver.maxResultSize","2G")
    conf.set("spark.yarn.executor.memoryOverhead","2G")
    conf.set("spark.yarn.driver.memoryOverhead","2G")
    conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
    conf.set("spark.memory.storageFraction","0.30")
    conf.set("spark.yarn.scheduler.reporterThread.maxFailures","5")
    conf.set("spark.storage.level","MEMORY_AND_DISK_SER")
    conf.set("spark.rdd.compress","true")
    conf.set("spark.shuffle.compress","true")
    conf.set("spark.shuffle.spill.compress","true")
    conf.set("spark.default.parallelism","2100")
    conf.set("spark.sql.shuffle.partitions","2100")

标签: apache-sparkamazon-emrganglia

解决方案


推荐阅读