首页 > 解决方案 > What determines parquet file buffer sizes

问题描述

I wrote a DataFrame in spark-shell into hdfs and I got the below output. What I want to understand is, what determines the size of the parquet files being written? My dfs.block.size is set to:

scala> spark.sparkContext.hadoopConfiguration.get("dfs.block.size")
res1: String = 134217728

which is 128MB, so why are my files in the 20,000,000 range of bytes?

-rw-r--r--   1 hadoop supergroup          0 2018-11-13 11:51 /new_sample_parquet_test/_SUCCESS
-rw-r--r--   1 hadoop supergroup   23631191 2018-11-13 11:51 /new_sample_parquet_test/part-00000-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   23435545 2018-11-13 11:51 /new_sample_parquet_test/part-00001-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22568091 2018-11-13 11:51 /new_sample_parquet_test/part-00002-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   23385544 2018-11-13 11:51 /new_sample_parquet_test/part-00003-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   23335676 2018-11-13 11:51 /new_sample_parquet_test/part-00004-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   23423372 2018-11-13 11:51 /new_sample_parquet_test/part-00005-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22182760 2018-11-13 11:51 /new_sample_parquet_test/part-00006-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20906453 2018-11-13 11:51 /new_sample_parquet_test/part-00007-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22918107 2018-11-13 11:51 /new_sample_parquet_test/part-00008-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21655224 2018-11-13 11:51 /new_sample_parquet_test/part-00009-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20366872 2018-11-13 11:51 /new_sample_parquet_test/part-00010-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22658141 2018-11-13 11:51 /new_sample_parquet_test/part-00011-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22246580 2018-11-13 11:51 /new_sample_parquet_test/part-00012-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20648612 2018-11-13 11:51 /new_sample_parquet_test/part-00013-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22369663 2018-11-13 11:51 /new_sample_parquet_test/part-00014-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   23396027 2018-11-13 11:51 /new_sample_parquet_test/part-00015-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   23382811 2018-11-13 11:51 /new_sample_parquet_test/part-00016-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   17470540 2018-11-13 11:51 /new_sample_parquet_test/part-00017-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22669018 2018-11-13 11:51 /new_sample_parquet_test/part-00018-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21899425 2018-11-13 11:51 /new_sample_parquet_test/part-00019-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21378060 2018-11-13 11:51 /new_sample_parquet_test/part-00020-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21578176 2018-11-13 11:51 /new_sample_parquet_test/part-00021-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21079291 2018-11-13 11:51 /new_sample_parquet_test/part-00022-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21526313 2018-11-13 11:51 /new_sample_parquet_test/part-00023-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22446489 2018-11-13 11:51 /new_sample_parquet_test/part-00024-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21770955 2018-11-13 11:51 /new_sample_parquet_test/part-00025-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   23199003 2018-11-13 11:51 /new_sample_parquet_test/part-00026-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21833916 2018-11-13 11:51 /new_sample_parquet_test/part-00027-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   25090443 2018-11-13 11:51 /new_sample_parquet_test/part-00028-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20725755 2018-11-13 11:51 /new_sample_parquet_test/part-00029-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20666565 2018-11-13 11:51 /new_sample_parquet_test/part-00030-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22299474 2018-11-13 11:51 /new_sample_parquet_test/part-00031-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22327133 2018-11-13 11:51 /new_sample_parquet_test/part-00032-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22207468 2018-11-13 11:51 /new_sample_parquet_test/part-00033-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22630251 2018-11-13 11:51 /new_sample_parquet_test/part-00034-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21648270 2018-11-13 11:51 /new_sample_parquet_test/part-00035-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22230127 2018-11-13 11:51 /new_sample_parquet_test/part-00036-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22439910 2018-11-13 11:51 /new_sample_parquet_test/part-00037-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22252551 2018-11-13 11:51 /new_sample_parquet_test/part-00038-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22160655 2018-11-13 11:51 /new_sample_parquet_test/part-00039-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   17637580 2018-11-13 11:51 /new_sample_parquet_test/part-00040-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21743969 2018-11-13 11:51 /new_sample_parquet_test/part-00041-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22071235 2018-11-13 11:51 /new_sample_parquet_test/part-00042-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21854771 2018-11-13 11:51 /new_sample_parquet_test/part-00043-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   25243330 2018-11-13 11:51 /new_sample_parquet_test/part-00044-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22297865 2018-11-13 11:51 /new_sample_parquet_test/part-00045-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22070057 2018-11-13 11:51 /new_sample_parquet_test/part-00046-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22018671 2018-11-13 11:51 /new_sample_parquet_test/part-00047-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21796749 2018-11-13 11:51 /new_sample_parquet_test/part-00048-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22597634 2018-11-13 11:51 /new_sample_parquet_test/part-00049-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20728588 2018-11-13 11:51 /new_sample_parquet_test/part-00050-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22137701 2018-11-13 11:51 /new_sample_parquet_test/part-00051-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22387635 2018-11-13 11:51 /new_sample_parquet_test/part-00052-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20965957 2018-11-13 11:51 /new_sample_parquet_test/part-00053-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20314451 2018-11-13 11:51 /new_sample_parquet_test/part-00054-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22538965 2018-11-13 11:51 /new_sample_parquet_test/part-00055-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20923261 2018-11-13 11:51 /new_sample_parquet_test/part-00056-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20984805 2018-11-13 11:51 /new_sample_parquet_test/part-00057-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20513317 2018-11-13 11:51 /new_sample_parquet_test/part-00058-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   25493903 2018-11-13 11:51 /new_sample_parquet_test/part-00059-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21178862 2018-11-13 11:51 /new_sample_parquet_test/part-00060-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20696540 2018-11-13 11:51 /new_sample_parquet_test/part-00061-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21011416 2018-11-13 11:51 /new_sample_parquet_test/part-00062-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   15752503 2018-11-13 11:51 /new_sample_parquet_test/part-00063-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet

标签: apache-sparkhadoophdfsparquet

解决方案


Parquet writer is not concerned with HDFS block size as you can save parquet e.g. on a local hard drive. What determines the number and sizes of individual part-*.parquet files is the number of partitions in your data frame (64 in your case). If you would do df.coalesce(1).write.parquet(...), you'll have just one large part file.

If you want the part files to be around 128 Mb each, coalesce parameter should be around 20 * 64 / 128 = 10. The part file size for a given number of coalesced partitions dependency is not strictly linear though. The smaller the number of part files, the more efficient encoding/compression is.

See coalesce method description for details


推荐阅读