Say I have a distribute system on 3 nodes and my data is distributed among those nodes. for example, I have a test.csv file which exists on all 3 nodes and it contains 2 columns of:
**row | id, c.**
---------------
row1 | k1 , c1
row2 | k1 , c2
row3 | k1 , c3
row4 | k2 , c4
row5 | k2 , c5
row6 | k2 , c6
row7 | k3 , c7
row8 | k3 , c8
row9 | k3 , c9
row10 | k4 , c10
row11 | k4 , c11
row12 | k4 , c12
Then I use SparkContext.textFile to read the file out as rdd and so. So far as I understand, each spark worker node will read the a portion out from the file. So right now let's say each node will store:
- node 1: row 1~4
- node 2: row 5~8
- node 3: row 9~12
My question is that let's say I want to do computation on those data, and there is one step that I need to group the key together, so the key value pair would be [k1 [{k1 c1} {k1 c2} {k1 c3}]]..
and so on.
There is a function called groupByKey()
which is very expensive to use, and aggregateByKey()
is recommended to use. So I'm wondering how does groupByKey()
and aggregateByKey()
works under the hood? Can someone using the example I provided above to explain please? After shuffling where does the rows reside on each node?