pyspark - Spark 3.0 - Reading performance when saved using .save() or .saveAsTable()

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

pyspark - Spark 3.0 - Reading performance when saved using .save() or .saveAsTable()

asked Jan 27, 2021 in Technique[技术] by 深蓝 (71.8m points)

I'm wondering if there are differences in performance (when reading) between those two commands?:

df.write.format('parquet').partitionBy(xx).save('/.../xx.parquet') df.write.format('parquet').partitionBy(xx).saveAsTable('...')

I understand that for bucketing the question doesn't arise as it is only used with managed tables (saveAsTable()) ; however, I'm a bit confused regarding partitioning as to if there is a method to privilege.

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

385 views

1 Answer

深蓝 · Answer 1 · 2021-01-27T04:28:06+0000

I've tried to find an answer experimentaly on a small dataframe and here are the results :

ENV = Databricks Community edition 
      [Attached to cluster: test, 15.25 GB | 2 Cores | DBR 7.4 | Spark 3.0.1 | Scala 2.12]

sqlContext.setConf( "spark.sql.shuffle.partitions", 2)
spark.conf.set("spark.sql.adaptive.enabled","true")

df.count() = 693243

RESULTS :

As expected writing using .saveAsTable() is a bit longer because it has to execute a dedicated "CreateDataSourceTableAsSelectCommand" to actually create the table. However, it is interesting to observe the difference when reading in favor of .saveAsTable() by nearly a factor of x10 in this simple example. I'd be very interested to compare the results on a much larger scale if someone ever has the ability to do it, and to understand what happens under the hood.

Categories

pyspark - Spark 3.0 - Reading performance when saved using .save() or .saveAsTable()

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags