在awk中打印搜索模式

我想打印匹配的搜索模式，然后计算平均行。最好将是一个expample：在awk中打印搜索模式

输入文件：

chr17 41275978 41276294 BRCA1_ex02_01 278 
chr17 41275978 41276294 BRCA1_ex02_01 279 
chr17 41275978 41276294 BRCA1_ex02_01 280 
chr17 41275978 41276294 BRCA1_ex02_02 281 
chr17 41275978 41276294 BRCA1_ex02_02 282 
chr17 41275978 41276294 BRCA1_ex02_03 283 
chr17 41275978 41276294 BRCA1_ex02_03 284 
chr17 41275978 41276294 BRCA1_ex02_03 285 
chr17 41275978 41276294 BRCA1_ex02_04 286 
chr17 41275978 41276294 BRCA1_ex02_04 287 
chr17 41275978 41276294 BRCA1_ex02_04 288

我在bash循环（例如）一样的第四列瓦纳提取物：

OUTPUT1：

chr17 41275978 41276294 BRCA1_ex02_01 278 
chr17 41275978 41276294 BRCA1_ex02_01 279 
chr17 41275978 41276294 BRCA1_ex02_01 280

OUTPUT2 ：

chr17 41275978 41276294 BRCA1_ex02_02 281 
chr17 41275978 41276294 BRCA1_ex02_02 282

OUTPUT3：

chr17 41275978 41276294 BRCA1_ex02_03 283 
chr17 41275978 41276294 BRCA1_ex02_03 284 
chr17 41275978 41276294 BRCA1_ex02_03 285

的等等。然后计算平均为第五列是很容易的：

AWK 'END {总和+ = $ 5} {打印NR /总和}' in_file.txt

在我的情况下，有数千行BRCA1_exXX_XX - 所以任何想法热分裂它？

Paul。

来源

2014-07-07 Geroge

假设项目分别由4列在给定的数据进行排序，你可以做这样的：

awk ' 

    $4 != prev {    # if this line's 4th column is different from the previous line 
    if (cnt > 0)   # if count of lines is greater than 0 
     print prev, sum/cnt # print the average 
    prev = $4    # save previous 4th column 
    sum = $5    # initialize sum to column 5 
    cnt = 1     # initialize count to 1 
    next     # go to next line 
    } 

    { 
    sum += $5    # accumulate total of 5th column 
    ++cnt     # increment count of lines 
    } 

    END { 
    if (cnt > 0)    # if count > 0 (avoid divide by 0 on empty file) 
     print prev, sum/cnt # print the average for the last line 
    } 

' file

来源

2014-07-07 14:44:31 ooga

这假设条目总是按顺序排列的。 –

Wau它看起来可以工作:-)谢谢！有可能解释吗？我可以添加到第三列标准偏差值吗？ – Geroge

@EtanReisner是的，它假定条目按第4列排序，如给定数据中所示。 – ooga

我认为这会做你想要什么。

awk '{ 
    # Keep running sum of fifth column based on value of fourth column. 
    v[$4]+=$5; 
    # Keep count of lines with similar fourth column values. 
    n[$4]++ 
} 

END { 
    # Loop over all the values we saw and print out their fourth columns and the sum of the fifth columns. 
    for (val in n) { 
     print val ": " v[val]/n[val] 
    } 
}' $file

来源

2014-07-07 14:51:35

切勿将字母'l'用作变量名，因为它看起来太像数字'1'。在某些字体中完全无法区分。 –

@EdMorton不够公平。我用它来代表“线”，但在这方面也没有什么意义。编辑。 –

是的，这太棒了 - 它工作得很好。谢谢你的解释！ – Geroge

在awk中打​​印搜索模式

回答

相关问题

在awk中打印搜索模式