1
首先,我认为问题标题并没有很好地解释这个问题。请随时更改标题或推荐更好的标题。按行名修改熊猫数据框
"sample","module","status","tot.seq","seq.length","pct.gc","pct.dup"
"ERR435952_cleaned_1","Basic Statistics","PASS","15529112","62",47,41.66
"ERR435952_cleaned_1","Per base sequence quality","FAIL","15529112","62",47,41.66
"ERR435952_cleaned_1","Per tile sequence quality","FAIL","15529112","62",47,41.66
"ERR435952_cleaned_1","Per sequence quality scores","PASS","15529112","62",47,41.66
"ERR435952_cleaned_1","Per base sequence content","PASS","15529112","62",47,41.66
"ERR435952_cleaned_1","Per sequence GC content","PASS","15529112","62",47,41.66
"ERR435952_cleaned_1","Per base N content","PASS","15529112","62",47,41.66
"ERR435952_cleaned_1","Sequence Length Distribution","PASS","15529112","62",47,41.66
"ERR435952_cleaned_1","Sequence Duplication Levels","WARN","15529112","62",47,41.66
"ERR435952_cleaned_1","Overrepresented sequences","WARN","15529112","62",47,41.66
"ERR435952_cleaned_1","Adapter Content","PASS","15529112","62",47,41.66
"ERR435952_cleaned_1","Kmer Content","FAIL","15529112","62",47,41.66
"ERR435952_cleaned_2","Basic Statistics","PASS","15529112","62",48,42.44
"ERR435952_cleaned_2","Per base sequence quality","PASS","15529112","62",48,42.44
"ERR435952_cleaned_2","Per tile sequence quality","WARN","15529112","62",48,42.44
"ERR435952_cleaned_2","Per sequence quality scores","PASS","15529112","62",48,42.44
"ERR435952_cleaned_2","Per base sequence content","PASS","15529112","62",48,42.44
"ERR435952_cleaned_2","Per sequence GC content","WARN","15529112","62",48,42.44
"ERR435952_cleaned_2","Per base N content","PASS","15529112","62",48,42.44
"ERR435952_cleaned_2","Sequence Length Distribution","PASS","15529112","62",48,42.44
"ERR435952_cleaned_2","Sequence Duplication Levels","WARN","15529112","62",48,42.44
"ERR435952_cleaned_2","Overrepresented sequences","WARN","15529112","62",48,42.44
"ERR435952_cleaned_2","Adapter Content","PASS","15529112","62",48,42.44
"ERR435952_cleaned_2","Kmer Content","FAIL","15529112","62",48,42.44
我想将其转换为这样的事情,这样我就可以创建基于PASS/FAIL/WARN值的简单的热图(包括读出的总数量:tot.seq):
我知道可以通过计数的行数做(存在用于每个模块/特征值区间之间的相关性),但是这是不完全纯的我不确定它对于大型数据集是否有效。有没有办法根据名称,而不是下面的时间间隔(即I,I + N ...等等)
谢谢!我忘了在我最初的问题中添加读取次数(tot.seq),因为它是每个样本的重复值(对每个模块重复),我怎样才能只添加一次? – Siddharth