基于相同的键折叠行

我想根据第一列的相等性折叠行。然后将第二列的内容添加到新的折叠表中，以逗号分隔并添加额外空间。另外，如果第二列的内容相同，则折叠它们，也就是说，如果输出文件中出现两次“非剧毒”，则只显示一次。基于相同的键折叠行

我在这里很新，请解释如何运行它。希望任何人都可以帮助我！

输入（制表符分隔）：

HS372_01446 non-virulent 
HS372_01446 non-virulent 
HS372_01446 lung 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 lung 
HS372_00498 lung 
HS372_00954 jointlungCNS 
HS372_00954 non-virulent 
HS372_00954 non-virulent 
HS372_00954 moderadamentevirulenta(nose) 
HS372_00954 lung

希望的输出（制表符分隔）：

HS372_01446 non-virulent, lung 
HS372_00498 non-virulent, lung 
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung

来源

2014-02-12 biotech

为什么有些你的输出行（最后1）在逗号和其他字符（前2）之后是否有空格？ –

嗨，埃德，这是一个错误。逗号后加空格。 – biotech

的Perl

从命令行，

perl -lane' 
    ($n, $p) [email protected]; 
    $s{$n}++ or push @r, $n; 
    $c{$n}{$p}++ or push @{$h{$n}}, $p; 
    END { 
    $" = ",\t"; 
    print "$_\[email protected]{$h{$_}}" for @r; 
    } 
' file

输出

HS372_01446  non-virulent, lung 
HS372_00498  non-virulent, lung 
HS372_00954  jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung

来源

2014-02-12 10:26:17

这是我见过的最多的一行代码！ – Borodin

from collections import defaultdict 

a = """HS372_01446 non-virulent 
HS372_01446 non-virulent 
HS372_01446 lung 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 lung 
HS372_00498 lung 
HS372_00954 jointlungCNS 
HS372_00954 non-virulent 
HS372_00954 non-virulent 
HS372_00954 moderadamentevirulenta(nose) 
HS372_00954 lung""".split("\n") 

stuff = defaultdict(set) 

for line in a: 
    uid, symp = line.split(" ") 
    stuff[uid].add(symp) 

for uid, symps in stuff.iteritems(): 
    print "%s %s" % (uid, ", ".join(list(symps)))

来源

2014-02-12 10:05:13

如何运行？ – biotech

@popnard：Python – Matthias

回溯（最近呼叫最后）：文件“脚本。PY “22行，在 UID，SYMP = line.split（”“） ValueError异常：需要比1点的值更解压 – biotech

在perl的：

use warnings; 
use strict; 

open my $input, '<', 'in.txt'; 

my %hash; 
while (<$input>){ 
    chomp; 
    my @split = split(' '); 
    $hash{$split[0]}{$split[1]} = 1; 
} 

for my $key (keys %hash){ 
    print "$key\t"; 
     for my $info (keys $hash{$key}){ 
      print "$info\t"; 
     } 
    print "\n"; 
}

哪个打印：

HS372_01446 non-virulent lung  
HS372_00954 non-virulent moderadamentevirulenta(nose) jointlungCNS lung  
HS372_00498 non-virulent lung

来源

2014-02-12 10:14:01 fugu

Bernardos-的MacBook-PRO：2014_02_12_membrane_genes_PHOBIUS贝尔纳$ ./ script.pl 键值的参数1的类型必须是./script.pl第16行中的散列或数组（不是散列元素），在“}）” 附近执行./script.pl由于编译错误而中止 – biotech

@ popnard - 复制并粘贴更新 – fugu

另一个Perl的溶液：

#!/usr/bin/perl 
use strict; 
use warnings; 
use List::MoreUtils qw/uniq/; 

my %hash; 
while (<DATA>) 
{ 
    chomp; 
    my ($key, $value) = split; 
    push @{$hash{$key}}, $value; 
} 

while (my ($key, $values) = each %hash) 
{ 
    print "$key\t", join ', ', uniq @$values, "\n"; 
} 

__DATA__ 
HS372_01446 non-virulent 
HS372_01446 non-virulent 
HS372_01446 lung 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 lung 
HS372_00498 lung 
HS372_00954 jointlungCNS 
HS372_00954 non-virulent 
HS372_00954 non-virulent 
HS372_00954 moderadamentevirulenta(nose) 
HS372_00954 lung

来源

2014-02-12 10:27:07 Chris

令人讨厌的变量名称：'％hash'与'$ scalar'一样有用''List :: MoreUtils'不是核心模块，可能需要安装，'chomp'没有任何意义，因为'split'会忽略任何空格。'“\ t”'很少用，因为它只对最最小的一组数据。但+1，因为这接近于未分类输出的最佳解决方案。 – Borodin

爪哇：

javac的Collapse.java

的Java收起input.txt中

import java.io.*; 
import java.util.*; 

public class Collapse { 

    public static void main(String[] args) throws Exception { 
     BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(args[0]))); 

     Map<String, Set<String>> output = new HashMap<String, Set<String>>(); 
     String line; 
     while ((line = br.readLine()) != null) { 
      StringTokenizer st = new StringTokenizer(line, "\t"); 
      String key = st.nextToken(); 
      Set<String> set = output.get(key); 
      if (set == null) { 
       output.put(key, set = new LinkedHashSet<String>()); 
      } 
      set.add(st.nextToken()); 
     } 

     for (String key : output.keySet()) { 
      StringBuilder sb = new StringBuilder(); 
      for (String value : output.get(key)) { 
       if (sb.length() != 0) sb.append(", "); 
       sb.append(value); 
      } 
      System.out.println(key + "\t" + sb); 
     } 
    } 
}

来源

2014-02-12 10:37:37 gabor

如果您的数据来自一个MySQL数据库（你可以将它导入一个），你可以使用group_concat操作。

看到这个答案 Can I concatenate multiple MySQL rows into one field?

这目前标有431个upvotes，所以你的问题是一个非常普遍的问题，答案显示出非常优雅的解决方案。

来源

2014-02-12 10:39:50 knb

这确实你问什么，除了保持在相同的顺序，它们出现在文件中的ID和描述中，如果该事项：

use strict; 
use warnings; 

open my $fh, '<', 'diseases.txt'; 

my %diseases; 
my @ids; 

while (<$fh>) { 
    my ($id, $desc) = split; 
    if (not $diseases{$id}) { 
    $diseases{$id}{list} = [$desc]; 
    $diseases{$id}{seen}{$desc} = 1; 
    push @ids, $id; 
    } 
    elsif (not $diseases{$id}{seen}{$desc}) { 
    push @{ $diseases{$id}{list} }, $desc; 
    $diseases{$id}{seen}{$desc} = 1; 
    } 
} 

for my $id (@ids) { 
    printf "%s %s\n", $id, join ', ', @{ $diseases{$id}{list} }; 
}

输出

HS372_01446 non-virulent, lung 
HS372_00498 non-virulent, lung 
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung

来源

2014-02-12 11:29:37 Borodin

解析文本文件的标准UNIX工具AWK：

$ awk '!seen[$1,$2]++{a[$1]=(a[$1] ? a[$1]", " : "\t") $2} END{for (i in a) print i a[i]}' file 
HS372_00498  non-virulent, lung 
HS372_00954  jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung 
HS372_01446  non-virulent, lung

来源

2014-02-12 12:22:25

基于相同的键折叠行

回答

相关问题