I need to encode parquet files which are produced by my pyspark script, so that the encoding is using RLE_DICTIONARY (https://www.slideshare.net/databricks/the-parquet-format-and-performance-optimization-opportunities).
Secondly, I need the compression to be applied, but not on the full file level, but I need the row group (split unit) level compression - ideally with snappy, so we can support parallel reads from Redshift Spectrum (https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html).
However, looking at the official parquet docs, there are only few parquet related properties that can be set (https://spark.apache.org/docs/2.4.3/sql-data-sources-parquet.html#configuration). This property:
spark.sql.parquet.compression.codec
defaults to snappy, but does that apply file level or split level compression (i.e. does it first produce parquet file and then snappy compresses, or first it snappy compresses row groups - splits, and then produces the file level?)
What is the default behavior here? Does the default behavior meet my requirement of applying split chunk compression instead of file level compression? Is the RLE_DICTIONARY a default encoding used by Spark? I couldn't find an option to define encoding itself?
question from:https://stackoverflow.com/questions/65844890/spark-parquet-compression-and-encoding-schemes