首页 > 解决方案 > 为什么 pyspark 脚本的性能不会随着内核和执行器数量的增加而提高?

问题描述

我有一个通过加载预训练模型进行二进制分类的脚本。我想知道为什么当我尝试 num-executors 和 executor-cores 的不同组合时,我总是得到大致相同的性能。这是我的 pyspark 脚本中的重要行:

start = time.time()

# extract evaluation pairs
#
aug_comb_mldf =  dfml_partial.select('eventId').crossJoin(df_id_zero_aug.select('eventId').withColumnRenamed('eventId', 'eventIde'))

pt1 = time.time()

# feature enigineering
#
feats_titles = ["feat1f", "feat2f", "feat3f",
                "feat4f", "feat5f", "feat6f"]
augfldf = aug_comb_mldf.join(dfml_partial.withColumnRenamed('eventId', 'eventId').alias('a'), ['eventId'], 'inner') \
   .join(dfallaug.withColumnRenamed('eventId', 'eventIde').drop('id').alias('b'), ['eventIde'], 'inner')\
   .withColumn('feat1f', when(expr('a.feat1 = b.feat1'), 1).otherwise(0))\
   .withColumn('feat2f', when(expr('a.feat2 = b.feat2'), 1).otherwise(0))\
   .withColumn('feat3f', when(expr('a.feat3 = b.feat3'), 1).otherwise(0))\
   .withColumn('feat4f', when(expr('a.feat4 = b.feat4'), 1).otherwise(0))\
   .withColumn('feat5f', when(expr('a.feat5 = b.feat5'), 1).otherwise(0))\
   .withColumn('feat6f', when(expr('a.feat6 = b.feat6'), 1).otherwise(0))\
   .select(feats_titles)

pt2 = time.time()

# Make predictions.
#
aug_predictions = model.transform(augfldf)

pt3 = time.time()

aug_predictions_true = aug_predictions.select("eventId", "eventIde", "id", "probability")
aug_predictions_true = aug_predictions_true.filter((aug_predictions.predictedLabel != '0'))

# find highest prob
#
w=Window().partitionBy("eventIde")

aug_predictions_true = aug_predictions_true.withColumn("rank", row_number().over(w.orderBy(desc("probability"))))\
        .filter(col("rank")==1)\
        .drop("rank")

pt4 = time.time()

print ("pt1-start = ", pt1-start)
print ("pt2-start = ", pt2-pt1)
print ("pt3-start = ", pt3-pt2)
print ("pt4-start = ", pt4-pt3)
print ("total = ", pt4-start)

这是表演:

('pt1-start = ', 0.034136056900024414)
('pt2-start = ', 0.41227102279663086)
('pt3-start = ', 0.12337303161621094)
('pt4-start = ', 0.1068110466003418)
('total = ', 0.676591157913208)

这是我运行此脚本的方式:

spark-submit --master yarn myapp.py --num-executors 16 --executor-cores 4 --executor-memory 12g --driver-memory 6g

我用你看到的四种配置的不同组合运行了 spark-submit,我总是得到大致相同的性能。

标签: performanceapache-sparkpyspark

解决方案


这个 --executor-cores 表示它将在执行程序中运行的并行线程数。但是在 python 中,由于 GIL(全局解释器锁)没有线程的概念,所以它不会运行并行线程。

因此,为了提高运行时的性能,我建议使用更多数量的执行器,而不是增加核心数量。


推荐阅读