Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Im using pyspark and I have a large data source that I want to repartition specifying the files size per partition explicitly.

I know using the repartition(500) function will split my parquet into 500 files with almost equal sizes. The problem is that new data gets added to this data source every day. On some days there might be a large input, and on some days there might be smaller inputs. So when looking at the partition file size distribution over a period of time, it varies between 200KB to 700KB per file.

I was thinking of specifying the max size per partition so that I get more or less the same file size per file per day irrespective of the number of files. This will help me when running my job on this large dataset later on to avoid skewed executor times and shuffle times etc.

Is there a way to specify it using the repartition() function or while writing the dataframe to parquet?

question from:https://stackoverflow.com/questions/65912908/how-to-specify-file-size-using-repartition-in-spark

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
1.2k views
Welcome To Ask or Share your Answers For Others

1 Answer

You could consider writing your result with the parameter maxRecordsPerFile.

storage_location = //...
estimated_records_with_desired_size = 2000
result_df.write.option(
     "maxRecordsPerFile", 
     estimated_records_with_desired_size) 
     .parquet(storage_location, compression="snappy")

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share

548k questions

547k answers

4 comments

86.3k users

...