首页 > 解决方案 > 我想合并用 spark 编写的每个分区中的所有多个文件,而不使用 repartitions 和 coalesce 以及使用 maxrecordperfile

问题描述

嗨,我在 pyspark 中使用下面的命令来写我的表,每个分区只有一个文件,这就是为什么我给了 25 mi 作为最大记录,而我的日常分区中只有 15 mi,所以它应该总是为每个分区创建 1 个文件,但在我的情况是它在每个分区中大约创建 20 个文件,但是早些时候它由于改组而写入了 200 个文件。

我已经尝试过重新分区和合并(1),但由于巨大的洗牌而挂起。

final_dedup_df.write.option(“maxRecordsPerFile”, 25000000).format(“parquet”).mode(‘append’).insertInto(“%s.%s”%(db_name,table_name), overwrite=True)

输出 :

113.7 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00009-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
13.6 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00019-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
113.2 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00029-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
113.0 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00047-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
112.9 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00065-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
112.7 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00083-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
112.7 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00101-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
112.7 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00119-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
112.7 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00137-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
112.7 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00155-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
112.6 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00173-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
112.6 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00191-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
112.6 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00209-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
75.8 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00224-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
62.5 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00235-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
62.5 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00245-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
62.4 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00255-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
62.3 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00265-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
62.3 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00275-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
62.3 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00285-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
62.3 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00295-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
62.1 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00305-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000

标签: apache-spark

解决方案


推荐阅读