Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have trouble to find in the Spark documentation operations that causes a shuffle and operation that does not. In this list, which ones does cause a shuffle and which ones does not?

Map and filter does not. However, I am not sure with the others.

map(func)
filter(func)
flatMap(func)
mapPartitions(func)
mapPartitionsWithIndex(func)
sample(withReplacement, fraction, seed)
union(otherDataset)
intersection(otherDataset)
distinct([numTasks]))
groupByKey([numTasks])
reduceByKey(func, [numTasks])
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
sortByKey([ascending], [numTasks])
join(otherDataset, [numTasks])
cogroup(otherDataset, [numTasks])
cartesian(otherDataset)
pipe(command, [envVars])
coalesce(numPartitions)
question from:https://stackoverflow.com/questions/26273664/what-are-the-spark-transformations-that-causes-a-shuffle

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
172 views
Welcome To Ask or Share your Answers For Others

1 Answer

It is actually extremely easy to find this out, without the documentation. For any of these functions just create an RDD and call to debug string, here is one example you can do the rest on ur own.

scala> val a  = sc.parallelize(Array(1,2,3)).distinct
scala> a.toDebugString
MappedRDD[5] at distinct at <console>:12 (1 partitions)
  MapPartitionsRDD[4] at distinct at <console>:12 (1 partitions)
    **ShuffledRDD[3] at distinct at <console>:12 (1 partitions)**
      MapPartitionsRDD[2] at distinct at <console>:12 (1 partitions)
        MappedRDD[1] at distinct at <console>:12 (1 partitions)
          ParallelCollectionRDD[0] at parallelize at <console>:12 (1 partitions)

So as you can see distinct creates a shuffle. It is also particularly important to find out this way rather than docs because there are situations where a shuffle will be required or not required for a certain function. For example join usually requires a shuffle but if you join two RDD's that branch from the same RDD spark can sometimes elide the shuffle.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share

548k questions

547k answers

4 comments

86.3k users

...