Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I am struggling with step where I want to write each RDD partition to separate parquet file with its own directory. Example will be:

    <root>
        <entity=entity1>
            <year=2015>
                <week=45>
                    data_file.parquet

Advantage of this format is I can use this directly in SparkSQL as columns and I will not have to repeat this data in actual file. This would be good way to get to get to specific partition without storing separate partitioning metadata someplace else.

?As a preceding step I have all the data loaded from large number of gzip files and partitioned based on the above key.

Possible way would be to get each partition as separate RDD and then write it though I couldn't find any good way of doing it.

Any help will be appreciated. By the way I am new to this stack.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
866 views
Welcome To Ask or Share your Answers For Others

1 Answer

I don't think the accepted answer appropriately answers the question.

Try something like this:

df.write.partitionBy("year", "month", "day").parquet("/path/to/output")

And you will get the partitioned directory structure.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share

548k questions

547k answers

4 comments

86.3k users

...