Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I need to encode parquet files which are produced by my pyspark script, so that the encoding is using RLE_DICTIONARY (https://www.slideshare.net/databricks/the-parquet-format-and-performance-optimization-opportunities).

Secondly, I need the compression to be applied, but not on the full file level, but I need the row group (split unit) level compression - ideally with snappy, so we can support parallel reads from Redshift Spectrum (https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html).

However, looking at the official parquet docs, there are only few parquet related properties that can be set (https://spark.apache.org/docs/2.4.3/sql-data-sources-parquet.html#configuration). This property:

spark.sql.parquet.compression.codec 

defaults to snappy, but does that apply file level or split level compression (i.e. does it first produce parquet file and then snappy compresses, or first it snappy compresses row groups - splits, and then produces the file level?)

What is the default behavior here? Does the default behavior meet my requirement of applying split chunk compression instead of file level compression? Is the RLE_DICTIONARY a default encoding used by Spark? I couldn't find an option to define encoding itself?

question from:https://stackoverflow.com/questions/65844890/spark-parquet-compression-and-encoding-schemes

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
1.1k views
Welcome To Ask or Share your Answers For Others

1 Answer

Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...