apache-spark - 我想合并用 spark 编写的每个分区中的所有多个文件,而不使用 repartitions 和 coalesce 以及使用 maxrecordperfile
问题描述
嗨,我在 pyspark 中使用下面的命令来写我的表,每个分区只有一个文件,这就是为什么我给了 25 mi 作为最大记录,而我的日常分区中只有 15 mi,所以它应该总是为每个分区创建 1 个文件,但在我的情况是它在每个分区中大约创建 20 个文件,但是早些时候它由于改组而写入了 200 个文件。
我已经尝试过重新分区和合并(1),但由于巨大的洗牌而挂起。
final_dedup_df.write.option(“maxRecordsPerFile”, 25000000).format(“parquet”).mode(‘append’).insertInto(“%s.%s”%(db_name,table_name), overwrite=True)
输出 :
113.7 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00009-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
13.6 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00019-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
113.2 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00029-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
113.0 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00047-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
112.9 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00065-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
112.7 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00083-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
112.7 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00101-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
112.7 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00119-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
112.7 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00137-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
112.7 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00155-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
112.6 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00173-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
112.6 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00191-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
112.6 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00209-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
75.8 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00224-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
62.5 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00235-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
62.5 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00245-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
62.4 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00255-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
62.3 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00265-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
62.3 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00275-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
62.3 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00285-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
62.3 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00295-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
62.1 M hdfs://labshdpds2/apps/hive/warehouse/dwh_staging.db/dwd_trd_aqc_trade_di/updated_dt=2019-07-01/part-00305-dcab0779-d49a-4dd7-b8fd-76001b02bc04.c000
解决方案
推荐阅读
- python - 保存为张量流图的 Keras Sequential 模型缺少火车操作?
- c# - 在框架中捕获 .Net Standard 异常
- python - SQLite 锁定在单线程,第一个事务
- mysql - Mysql如何获取给定ID的父母和所有孩子?
- ruby-on-rails - rails 6:没有将字符串隐式转换为整数
- python - TypeError: write() 参数必须是 str,而不是 HTTPResponse
- docker - 无法启动导入的图像文件
- html - 如何将 .swf 文件转换为 HTML 文件?
- javascript - 在 jQuery 之后加载的脚本中获取“$ 未定义”
- ios - 将 Mac 更新为 Catalina,此后在工作区中运行 pod 时出现 Cocopods 或 Ruby 错误