I am running Spark-1.4.0 pre-built for Hadoop-2.4 (in local mode) to calculate the sum of squares of a DoubleRDD. My Scala code looks like
sc.parallelize(Array(2., 3.)).fold(0.0)((p, v) => p+v*v)
And it gave a surprising result 97.0
.
This is quite counter-intuitive compared to the Scala version of fold
Array(2., 3.).fold(0.0)((p, v) => p+v*v)
which gives the expected answer 13.0
.
It seems quite likely that I have made some tricky mistakes in the code due to a lack of understanding. I have read about how the function used in RDD.fold()
should be communicative otherwise the result may depend on partitions and etc. So example, if I change the number of partitions to 1,
sc.parallelize(Array(2., 3.), 1).fold(0.0)((p, v) => p+v*v)
the code will give me 169.0
on my machine!
Can someone explain what exactly is happening here?
See Question&Answers more detail:os