2014-02-12 36 views
0

我想根据第一列的相等性折叠行。然后将第二列的内容添加到新的折叠表中,以逗号分隔并添加额外空间。另外,如果第二列的内容相同,则折叠它们,也就是说,如果输出文件中出现两次“非剧毒”,则只显示一次。基于相同的键折叠行

我在这里很新,请解释如何运行它。希望任何人都可以帮助我!

输入(制表符分隔):

HS372_01446 non-virulent 
HS372_01446 non-virulent 
HS372_01446 lung 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 lung 
HS372_00498 lung 
HS372_00954 jointlungCNS 
HS372_00954 non-virulent 
HS372_00954 non-virulent 
HS372_00954 moderadamentevirulenta(nose) 
HS372_00954 lung 

希望的输出(制表符分隔):

HS372_01446 non-virulent, lung 
HS372_00498 non-virulent, lung 
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung 
+1

为什么有些你的输出行(最后1)在逗号和其他字符(前2)之后是否有空格? –

+0

嗨,埃德,这是一个错误。逗号后加空格。 – biotech

回答

2
的Perl

从命令行,

perl -lane' 
    ($n, $p) [email protected]; 
    $s{$n}++ or push @r, $n; 
    $c{$n}{$p}++ or push @{$h{$n}}, $p; 
    END { 
    $" = ",\t"; 
    print "$_\[email protected]{$h{$_}}" for @r; 
    } 
' file 

输出

HS372_01446  non-virulent, lung 
HS372_00498  non-virulent, lung 
HS372_00954  jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung 
+2

这是我见过的最多的一行代码! – Borodin

1
from collections import defaultdict 

a = """HS372_01446 non-virulent 
HS372_01446 non-virulent 
HS372_01446 lung 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 lung 
HS372_00498 lung 
HS372_00954 jointlungCNS 
HS372_00954 non-virulent 
HS372_00954 non-virulent 
HS372_00954 moderadamentevirulenta(nose) 
HS372_00954 lung""".split("\n") 

stuff = defaultdict(set) 

for line in a: 
    uid, symp = line.split(" ") 
    stuff[uid].add(symp) 

for uid, symps in stuff.iteritems(): 
    print "%s %s" % (uid, ", ".join(list(symps))) 
+0

如何运行? – biotech

+0

@popnard:Python – Matthias

+0

回溯(最近呼叫最后): 文件“脚本。PY “22行,在 UID,SYMP = line.split(”“) ValueError异常:需要比1点的值更解压 – biotech

0

在perl的:

use warnings; 
use strict; 

open my $input, '<', 'in.txt'; 

my %hash; 
while (<$input>){ 
    chomp; 
    my @split = split(' '); 
    $hash{$split[0]}{$split[1]} = 1; 
} 

for my $key (keys %hash){ 
    print "$key\t"; 
     for my $info (keys $hash{$key}){ 
      print "$info\t"; 
     } 
    print "\n"; 
} 

哪个打印:

HS372_01446 non-virulent lung  
HS372_00954 non-virulent moderadamentevirulenta(nose) jointlungCNS lung  
HS372_00498 non-virulent lung 
+0

Bernardos-的MacBook-PRO:2014_02_12_membrane_genes_PHOBIUS贝尔纳$ ./ script.pl 键值的参数1的类型必须是./script.pl第16行中的散列或数组(不是散列元素),在“})” 附近执行./script.pl由于编译错误而中止 – biotech

+0

@ popnard - 复制并粘贴更新 – fugu

2

另一个Perl的溶液:

#!/usr/bin/perl 
use strict; 
use warnings; 
use List::MoreUtils qw/uniq/; 

my %hash; 
while (<DATA>) 
{ 
    chomp; 
    my ($key, $value) = split; 
    push @{$hash{$key}}, $value; 
} 

while (my ($key, $values) = each %hash) 
{ 
    print "$key\t", join ', ', uniq @$values, "\n"; 
} 

__DATA__ 
HS372_01446 non-virulent 
HS372_01446 non-virulent 
HS372_01446 lung 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 lung 
HS372_00498 lung 
HS372_00954 jointlungCNS 
HS372_00954 non-virulent 
HS372_00954 non-virulent 
HS372_00954 moderadamentevirulenta(nose) 
HS372_00954 lung 
+0

令人讨厌的变量名称:'%hash'与'$ scalar'一样有用''List :: MoreUtils'不是核心模块,可能需要安装,'chomp'没有任何意义,因为'split'会忽略任何空格。'“\ t”'很少用,因为它只对最最小的一组数据。但+1,因为这接近于未分类输出的最佳解决方案。 – Borodin

1

爪哇:

javac的Collapse.java

的Java收起input.txt中

import java.io.*; 
import java.util.*; 

public class Collapse { 

    public static void main(String[] args) throws Exception { 
     BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(args[0]))); 

     Map<String, Set<String>> output = new HashMap<String, Set<String>>(); 
     String line; 
     while ((line = br.readLine()) != null) { 
      StringTokenizer st = new StringTokenizer(line, "\t"); 
      String key = st.nextToken(); 
      Set<String> set = output.get(key); 
      if (set == null) { 
       output.put(key, set = new LinkedHashSet<String>()); 
      } 
      set.add(st.nextToken()); 
     } 

     for (String key : output.keySet()) { 
      StringBuilder sb = new StringBuilder(); 
      for (String value : output.get(key)) { 
       if (sb.length() != 0) sb.append(", "); 
       sb.append(value); 
      } 
      System.out.println(key + "\t" + sb); 
     } 
    } 
} 
0

如果您的数据来自一个MySQL数据库(你可以将它导入一个),你可以使用group_concat操作。

看到这个答案 Can I concatenate multiple MySQL rows into one field?

这目前标有431个upvotes,所以你的问题是一个非常普遍的问题,答案显示出非常优雅的解决方案。

2

这确实你问什么,除了保持在相同的顺序,它们出现在文件中的ID和描述中,如果该事项:

use strict; 
use warnings; 

open my $fh, '<', 'diseases.txt'; 

my %diseases; 
my @ids; 

while (<$fh>) { 
    my ($id, $desc) = split; 
    if (not $diseases{$id}) { 
    $diseases{$id}{list} = [$desc]; 
    $diseases{$id}{seen}{$desc} = 1; 
    push @ids, $id; 
    } 
    elsif (not $diseases{$id}{seen}{$desc}) { 
    push @{ $diseases{$id}{list} }, $desc; 
    $diseases{$id}{seen}{$desc} = 1; 
    } 
} 

for my $id (@ids) { 
    printf "%s %s\n", $id, join ', ', @{ $diseases{$id}{list} }; 
} 

输出

HS372_01446 non-virulent, lung 
HS372_00498 non-virulent, lung 
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung 
1

解析文本文件的标准UNIX工具AWK:

$ awk '!seen[$1,$2]++{a[$1]=(a[$1] ? a[$1]", " : "\t") $2} END{for (i in a) print i a[i]}' file 
HS372_00498  non-virulent, lung 
HS372_00954  jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung 
HS372_01446  non-virulent, lung