2017-06-16 120 views
1

首先,我认为问题标题并没有很好地解释这个问题。请随时更改标题或推荐更好的标题。按行名修改熊猫数据框

我读一个CSV文件格式: enter image description here

"sample","module","status","tot.seq","seq.length","pct.gc","pct.dup" 
"ERR435952_cleaned_1","Basic Statistics","PASS","15529112","62",47,41.66 
"ERR435952_cleaned_1","Per base sequence quality","FAIL","15529112","62",47,41.66 
"ERR435952_cleaned_1","Per tile sequence quality","FAIL","15529112","62",47,41.66 
"ERR435952_cleaned_1","Per sequence quality scores","PASS","15529112","62",47,41.66 
"ERR435952_cleaned_1","Per base sequence content","PASS","15529112","62",47,41.66 
"ERR435952_cleaned_1","Per sequence GC content","PASS","15529112","62",47,41.66 
"ERR435952_cleaned_1","Per base N content","PASS","15529112","62",47,41.66 
"ERR435952_cleaned_1","Sequence Length Distribution","PASS","15529112","62",47,41.66 
"ERR435952_cleaned_1","Sequence Duplication Levels","WARN","15529112","62",47,41.66 
"ERR435952_cleaned_1","Overrepresented sequences","WARN","15529112","62",47,41.66 
"ERR435952_cleaned_1","Adapter Content","PASS","15529112","62",47,41.66 
"ERR435952_cleaned_1","Kmer Content","FAIL","15529112","62",47,41.66 
"ERR435952_cleaned_2","Basic Statistics","PASS","15529112","62",48,42.44 
"ERR435952_cleaned_2","Per base sequence quality","PASS","15529112","62",48,42.44 
"ERR435952_cleaned_2","Per tile sequence quality","WARN","15529112","62",48,42.44 
"ERR435952_cleaned_2","Per sequence quality scores","PASS","15529112","62",48,42.44 
"ERR435952_cleaned_2","Per base sequence content","PASS","15529112","62",48,42.44 
"ERR435952_cleaned_2","Per sequence GC content","WARN","15529112","62",48,42.44 
"ERR435952_cleaned_2","Per base N content","PASS","15529112","62",48,42.44 
"ERR435952_cleaned_2","Sequence Length Distribution","PASS","15529112","62",48,42.44 
"ERR435952_cleaned_2","Sequence Duplication Levels","WARN","15529112","62",48,42.44 
"ERR435952_cleaned_2","Overrepresented sequences","WARN","15529112","62",48,42.44 
"ERR435952_cleaned_2","Adapter Content","PASS","15529112","62",48,42.44 
"ERR435952_cleaned_2","Kmer Content","FAIL","15529112","62",48,42.44 

我想将其转换为这样的事情,这样我就可以创建基于PASS/FAIL/WARN值的简单的热图(包括读出的总数量:tot.seq): enter image description here

我知道可以通过计数的行数做(存在用于每个模块/特征值区间之间的相关性),但是这是不完全纯的我不确定它对于大型数据集是否有效。有没有办法根据名称,而不是下面的时间间隔(即I,I + N ...等等)

回答

2

使用set_index + unstack,也为列从索引添加reset_indexrename_axis用于删除映射值module - 列名:

df = df.set_index(['sample', 'tot.seq', 'module'])['status'].unstack() \ 
     .reset_index().rename_axis(None, axis=1) 
print (df) 
       sample tot.seq Adapter Content Basic Statistics \ 
0 ERR435952_cleaned_1 15529112   PASS    PASS 
1 ERR435952_cleaned_2 15529112   PASS    PASS 

    Kmer Content Overrepresented sequences Per base N content \ 
0   FAIL      WARN    PASS 
1   FAIL      WARN    PASS 

    Per base sequence content Per base sequence quality Per sequence GC content \ 
0      PASS      FAIL     PASS 
1      PASS      PASS     WARN 

    Per sequence quality scores Per tile sequence quality \ 
0      PASS      FAIL 
1      PASS      WARN 

    Sequence Duplication Levels Sequence Length Distribution 
0      WARN       PASS 
1      WARN       PASS 

但如果得到:

ValueError: Index contains duplicate entries, cannot reshape

再有重复,需要汇总数据:

print (df) 
       sample      module status tot.seq \ 
0 ERR435952_cleaned_1    Basic Statistics PASS 15529112 
1 ERR435952_cleaned_1 Per base sequence quality FAIL 15529112 
2 ERR435952_cleaned_1 Per base sequence quality FAIL 15529112 
3 ERR435952_cleaned_1 Per sequence quality scores PASS 15529112 

    seq.length pct.gc pct.dup 
0   62  47 41.66 
1   62  47 41.66 
2   62  47 41.66 
3   62  47 41.66 

df = df.pivot_table(index=['sample', 'tot.seq'], columns='module', values='status', aggfunc=', '.join) \ 
     .reset_index().rename_axis(None, axis=1) 
print (df) 
       sample tot.seq Basic Statistics Per base sequence quality \ 
0 ERR435952_cleaned_1 15529112    PASS    FAIL, FAIL 

    Per sequence quality scores 
0      PASS 

df = df.groupby(['sample', 'tot.seq', 'module'])['status'].apply(', '.join).unstack() \ 
     .reset_index().rename_axis(None, axis=1) 
print (df) 

       sample tot.seq Basic Statistics Per base sequence quality \ 
0 ERR435952_cleaned_1 15529112    PASS    FAIL, FAIL 

    Per sequence quality scores 
0      PASS 
+0

谢谢!我忘了在我最初的问题中添加读取次数(tot.seq),因为它是每个样本的重复值(对每个模块重复),我怎样才能只添加一次? – Siddharth