Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have a Spark Job that reads data from S3. I apply some transformations and write 2 datasets back to S3. Each write action is treated as a separate job.

Question: Does Spark guarantees that I read the data each time in the same order? For example, if I apply the function:

.withColumn('id', f.monotonically_increasing_id())

Will the id column have the same values for the same records each time?

question from:https://stackoverflow.com/questions/66055679/does-spark-guarantee-consistency-when-reading-data-from-s3

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
227 views
Welcome To Ask or Share your Answers For Others

1 Answer

You state very little, but the following is easily testable and should serve as a guideline:

  • If you re-read the same files again with same content you will get the same blocks / partitions again and the same id using f.monotonically_increasing_id().

  • If the total number of rows differs on the successive read(s) with different partitioning applied before this function, then typically you will get different id's.

  • If you have more data second time round and apply coalesce(1) then the prior entries will have same id still, newer rows will have other ids. A less than realistic scenario of course.

Blocks for files at rest remain static (in general) on HDFS. So partition 0..N will be the same upon reading from rest. Otherwise zipWithIndex would not be usable either.

I would never rely on the same data being in same place when read twice unless there were no updates (you could cache as well).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share

548k questions

547k answers

4 comments

86.3k users

...