I have a tab delim file
LOC105758527 1 55001 0.469590
LOC105758527 1 65001 0.067909
LOC105758527 1 75001 0.220712
LOC100218126 1 85001 0.174872
LOC105758529 1 125001 0.023420
NRF1 1 155001 0.242222
NRF1 1 165001 0.202569
NRF1 1 175001 0.327963
UBE2H 1 215001 0.063989
UBE2H 1 225001 0.542340
KLHDC10 1 255001 0.293471
KLHDC10 1 265001 0.231621
KLHDC10 1 275001 0.142917
TMEM209 1 295001 0.273941
CPA2 1 315001 0.181312
I need to calculate the average for col 4 for each element in col 1. So the sum/line count and print col1,2,3 of the 1st line in the computation and the avg as col 4.
I started with just doing the sum
awk 'BEGIN { FS = OFS = " " }
{ y[$1] += $4; $4 = y[$1]; x[$1] = $0; }
END { for (i in x) { print x[i]; } }' file
But I'm getting
NRF1 1 175001 0.772754
LOC105758529 1 125001 0.02342
LOC100218126 1 85001 0.174872
KLHDC10 1 275001 0.668009
CPA2 1 315001 0.181312
TMEM209 1 295001 0.273941
UBE2H 1 225001 0.606329
LOC105758527 1 75001 0.758211
Which means it's jumping to some line other than the 1st in my file (and printing col1,2,3 from the last line calculated - which is fine but I would prefer the 1st line instead). The output is out of order.
I also don't know how to divide the sum by their NRs to actually get the average
See Question&Answers more detail:os