pyspark - How to specify file size using repartition() in spark

Question

Welcome To Ask or Share your Answers For Others

pyspark - How to specify file size using repartition() in spark

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

Im using pyspark and I have a large data source that I want to repartition specifying the files size per partition explicitly.

I know using the repartition(500) function will split my parquet into 500 files with almost equal sizes. The problem is that new data gets added to this data source every day. On some days there might be a large input, and on some days there might be smaller inputs. So when looking at the partition file size distribution over a period of time, it varies between 200KB to 700KB per file.

I was thinking of specifying the max size per partition so that I get more or less the same file size per file per day irrespective of the number of files. This will help me when running my job on this large dataset later on to avoid skewed executor times and shuffle times etc.

Is there a way to specify it using the repartition() function or while writing the dataframe to parquet?

question from:https://stackoverflow.com/questions/65912908/how-to-specify-file-size-using-repartition-in-spark

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1.2k views

1 Answer

深蓝 · Answer 1 · 2021-10-06T19:11:07+0000

You could consider writing your result with the parameter maxRecordsPerFile.

storage_location = //...
estimated_records_with_desired_size = 2000
result_df.write.option(
     "maxRecordsPerFile", 
     estimated_records_with_desired_size) 
     .parquet(storage_location, compression="snappy")

Categories

pyspark - How to specify file size using repartition() in spark

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags