I would like to repartition / coalesce my data so that it is saved into one Parquet file per partition. I would also like to use the Spark SQL partitionBy API. So I could do that like this:
df.coalesce(1)
.write
.partitionBy("entity", "year", "month", "day", "status")
.mode(SaveMode.Append)
.parquet(s"$location")
I've tested this and it doesn't seem to perform well. This is because there is only one partition to work on in the dataset and all the partitioning, compression and saving of files has to be done by one CPU core.
I could rewrite this to do the partitioning manually (using filter with the distinct partition values for example) before calling coalesce.
But is there a better way to do this using the standard Spark SQL API?
See Question&Answers more detail:os