我正在使用grep和一个具有多种搜索模式的文件。作为输出,我希望获得匹配的模式和特定模式的出现次数。使用grep和模式文件来统计文件中的单个模式匹配
cat pattern.txt
AT3G09260.1
AT5G50920.1
输入文件看起来像这样
>AT2G44750.1 | Symbols: TPK2 | thiamin pyrophosphokinase 2 | chr2:18451510-18452754 FORWARD LENGTH=265
>AT2G47140.1 | Symbols: | NAD(P)-binding Rossmann-fold superfamily protein | chr2:19350970-19352059 REVERSE LENGTH=257
>AT2G47120.1 | Symbols: | NAD(P)-binding Rossmann-fold superfamily protein
>AT1G21470.1 | Symbols: | BEST Arabidopsis thaliana protein match is: CLPC homologue 1 (TAIR:AT5G50920.1); Has 326 Blast hits to 324 proteins in 95 species: Archae - 0; Bacteria - 130; Metazoa - 0; Fungi - 0; Plants - 67; Viruses - 0; Other Eukaryotes - 129 (source: NCBI BLink). | chr1:7516709-7517179 REVERSE LENGTH=118
>AT3G09260.1 | Symbols: PYK10, PSR3.1, BGLU23, LEB | Glycosyl hydrolase superfamily protein | chr3:2840657-2843730 REVERSE LENGTH=524
>AT5G48175.1 | Symbols: | FUNCTIONS IN: molecular_function unknown; INVOLVED IN: biological_process unknown; LOCATED IN: endomembrane system; EXPRESSED IN: hypocotyl, male gametophyte, root; BEST Arabidopsis thaliana protein match is: Glycosyl hydrolase superfamily protein (TAIR:AT3G09260.1); Has 30201 Blast hits to 17322 proteins in 780 species: Archae - 12; Bacteria - 1396; Metazoa - 17338; Fungi - 3422; Plants - 5037; Viruses - 0; Other Eukaryotes - 2996 (source: NCBI BLink). | chr5:19539208-19539676 FORWARD LENGTH=115
>AT5G50920.1 | Symbols: CLPC, ATHSP93-V, HSP93-V, DCA1, CLPC1 | CLPC homologue 1 | chr5:20715710-20719800 REVERSE LENGTH=929
我想获得像
AT3G09260.1 2
AT5G50920.1 2
我已经试过
grep -f pattern.txt -c inputfile.txt
4
但只给了我匹配行的总数(fo所有模式)。 我相信这个问题是已经在这里问,但从来没有得到解决
how to loop over pattern from a file with grep
谢谢。
为什么写*,但从来没有得到解决* ?该问题已被回答 – RomanPerekhrest
提供的awk脚本没有给出所需的输出 – marie