首页 > 解决方案 > PySpark 脚本缓慢,体积非常小

问题描述

在 8 核/32G Windows 计算机上运行以下 Python 脚本大约需要 40 分钟。为什么这么慢?

for he in range(1, 25):
    he_str = str(he)
    ### df_all is a dataframe that contains only 3200 records ###
    ### df_all does contain 146 columns. Maybe this is why? ###
    df_all = df_all.withColumn('PROFIT_INC_HE' + he_str, functions.lit(0))
    df_all = df_all.withColumn('PROFIT_DEC_HE' + he_str, functions.lit(0))

    ### TIER_PRICE_FACTORS is list of 4 elements ###
    for tiers in TIER_PRICE_FACTORS:
        tiers_str = str(tiers).replace('.', '')

        df_all = df_all.withColumn('PROFIT_INC_HE' + he_str, functions.col('PROFIT_INC_HE' + he_str) \
                                    + functions.col('BID_PROFIT_INC_HE' + he_str + '_' + tiers_str))

        df_all = df_all.withColumn('PROFIT_DEC_HE' + he_str, functions.col('PROFIT_DEC_HE' + he_str) \
                                    + functions.col('BID_PROFIT_Dec_HE' + he_str + '_' + tiers_str))

标签: apache-sparkpyspark

解决方案


推荐阅读