Im using pyspark and I have a large data source that I want to repartition specifying the files size per partition explicitly.
I know using the repartition(500)
function will split my parquet into 500 files with almost equal sizes.
The problem is that new data gets added to this data source every day. On some days there might be a large input, and on some days there might be smaller inputs. So when looking at the partition file size distribution over a period of time, it varies between 200KB
to 700KB
per file.
I was thinking of specifying the max size per partition so that I get more or less the same file size per file per day irrespective of the number of files. This will help me when running my job on this large dataset later on to avoid skewed executor times and shuffle times etc.
Is there a way to specify it using the repartition()
function or while writing the dataframe to parquet?