2014-07-07 75 views
0

我想打印匹配的搜索模式,然后计算平均行。最好将是一个expample:在awk中打​​印搜索模式

输入文件:

chr17 41275978 41276294 BRCA1_ex02_01 278 
chr17 41275978 41276294 BRCA1_ex02_01 279 
chr17 41275978 41276294 BRCA1_ex02_01 280 
chr17 41275978 41276294 BRCA1_ex02_02 281 
chr17 41275978 41276294 BRCA1_ex02_02 282 
chr17 41275978 41276294 BRCA1_ex02_03 283 
chr17 41275978 41276294 BRCA1_ex02_03 284 
chr17 41275978 41276294 BRCA1_ex02_03 285 
chr17 41275978 41276294 BRCA1_ex02_04 286 
chr17 41275978 41276294 BRCA1_ex02_04 287 
chr17 41275978 41276294 BRCA1_ex02_04 288 

我在bash循环(例如)一样的第四列瓦纳提取物:

OUTPUT1:

chr17 41275978 41276294 BRCA1_ex02_01 278 
chr17 41275978 41276294 BRCA1_ex02_01 279 
chr17 41275978 41276294 BRCA1_ex02_01 280 

OUTPUT2 :

chr17 41275978 41276294 BRCA1_ex02_02 281 
chr17 41275978 41276294 BRCA1_ex02_02 282 

OUTPUT3:

chr17 41275978 41276294 BRCA1_ex02_03 283 
chr17 41275978 41276294 BRCA1_ex02_03 284 
chr17 41275978 41276294 BRCA1_ex02_03 285 

的等等。然后计算平均为第五列是很容易的:

AWK 'END {总和+ = $ 5} {打印NR /总和}' in_file.txt

在我的情况下,有数千行BRCA1_exXX_XX - 所以任何想法热分裂它?

Paul。

回答

1

假设项目分别由4列在给定的数据进行排序,你可以做这样的:

awk ' 

    $4 != prev {    # if this line's 4th column is different from the previous line 
    if (cnt > 0)   # if count of lines is greater than 0 
     print prev, sum/cnt # print the average 
    prev = $4    # save previous 4th column 
    sum = $5    # initialize sum to column 5 
    cnt = 1     # initialize count to 1 
    next     # go to next line 
    } 

    { 
    sum += $5    # accumulate total of 5th column 
    ++cnt     # increment count of lines 
    } 

    END { 
    if (cnt > 0)    # if count > 0 (avoid divide by 0 on empty file) 
     print prev, sum/cnt # print the average for the last line 
    } 

' file 
+0

这假设条目总是按顺序排列的。 –

+0

Wau它看起来可以工作:-)谢谢!有可能解释吗?我可以添加到第三列标准偏差值吗? – Geroge

+0

@EtanReisner是的,它假定条目按第4列排序,如给定数据中所示。 – ooga

2

我认为这会做你想要什么。

awk '{ 
    # Keep running sum of fifth column based on value of fourth column. 
    v[$4]+=$5; 
    # Keep count of lines with similar fourth column values. 
    n[$4]++ 
} 

END { 
    # Loop over all the values we saw and print out their fourth columns and the sum of the fifth columns. 
    for (val in n) { 
     print val ": " v[val]/n[val] 
    } 
}' $file 
+0

切勿将字母'l'用作变量名,因为它看起来太像数字'1'。在某些字体中完全无法区分。 –

+1

@EdMorton不够公平。我用它来代表“线”,但在这方面也没有什么意义。编辑。 –

+0

是的,这太棒了 - 它工作得很好。谢谢你的解释! – Geroge