Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have a tab delim file

LOC105758527    1       55001   0.469590
LOC105758527    1       65001   0.067909
LOC105758527    1       75001   0.220712
LOC100218126    1       85001   0.174872
LOC105758529    1       125001  0.023420
NRF1    1       155001  0.242222
NRF1    1       165001  0.202569
NRF1    1       175001  0.327963
UBE2H   1       215001  0.063989
UBE2H   1       225001  0.542340
KLHDC10 1       255001  0.293471
KLHDC10 1       265001  0.231621
KLHDC10 1       275001  0.142917
TMEM209 1       295001  0.273941
CPA2    1       315001  0.181312

I need to calculate the average for col 4 for each element in col 1. So the sum/line count and print col1,2,3 of the 1st line in the computation and the avg as col 4.

I started with just doing the sum

awk 'BEGIN { FS = OFS = "	" }
        { y[$1] += $4; $4 = y[$1]; x[$1] = $0; }
END { for (i in x) { print x[i]; } }' file

But I'm getting

NRF1    1       175001  0.772754
LOC105758529    1       125001  0.02342
LOC100218126    1       85001   0.174872
KLHDC10 1       275001  0.668009
CPA2    1       315001  0.181312
TMEM209 1       295001  0.273941
UBE2H   1       225001  0.606329
LOC105758527    1       75001   0.758211

Which means it's jumping to some line other than the 1st in my file (and printing col1,2,3 from the last line calculated - which is fine but I would prefer the 1st line instead). The output is out of order.

I also don't know how to divide the sum by their NRs to actually get the average

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
145 views
Welcome To Ask or Share your Answers For Others

1 Answer

It can be done in just awk by using arrays to store line ordering and intermediate computation steps:

# set fields delimiters
BEGIN { FS = OFS = "	" }

# print the header
NR==1 { print; next }

# the first time col1 value occurs, store col1..col3
!h[$1] {
    h[$1] = ++n  # save ordering
    d[n] = $1 OFS $2 OFS $3  # save first 3 columns
}

# store sum and quantity of col4
{
    i = h[$1]  # recover ordering
    s[i] += $4
    q[i]++
}

# output col1..col3 and the average value
END {
    for (i=1; i<=n; i++) print d[i], s[i]/q[i]
}

I see you have edited the question since I wrote the above. If your data has no header then the NR==1 line will not be required.

If your data file is really big, the script above may consume too much memory (it will use memory proportional to the number of unique values for col1). If this will be problematic and the order of the output lines is not important, memory usage can be reduced drastically by pre-sorting the data (perhaps with sort -k1,1 -s), and producing output incrementally:

BEGIN { FS = OFS = "	" }

$1 != c1 {
    if (c1) print d, s/q
    d = $1 OFS $2 OFS $3
    s = q = 0
    c1 = $1
}

{
    s += $4
    q++
}

END { print d, s/q }

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...