的范围在下面的数据合并两只大熊猫dataframes(或转让值):如何通过比较值
data01 =
contig start end haplotype_block
2 5207 5867 1856
2 155667 155670 2816
2 67910 68022 2
2 68464 68483 3
2 525 775 132
2 118938 119559 1157
data02 =
contig start last feature gene_id gene_name transcript_id
2 5262 5496 exon scaffold_200003.1 CP5 scaffold_200003.1
2 5579 5750 exon scaffold_200003.1 CP5 scaffold_200003.1
2 5856 6032 exon scaffold_200003.1 CP5 scaffold_200003.1
2 6115 6198 exon scaffold_200003.1 CP5 scaffold_200003.1
2 916 1201 exon scaffold_200001.1 NA scaffold_200001.1
2 614 789 exon scaffold_200001.1 NA scaffold_200001.1
2 171 435 exon scaffold_200001.1 NA scaffold_200001.1
2 2677 2806 exon scaffold_200002.1 NA scaffold_200002.1
2 2899 3125 exon scaffold_200002.1 NA scaffold_200002.1
问题:
- 我要比较的范围(开始 - 结束)这两个数据帧。
- 如果范围重叠,我想将数据02中的
gene_id
和gene_name
值传送到data01中的新列。
我尝试(使用熊猫):
data01['gene_id'] = ""
data01['gene_name'] = ""
data01['gene_id'] = data01['gene_id'].\
apply(lambda x: data02['gene_id']\
if range(data01['start'], data01['end'])\
<= range(data02['start'], data02['last']) else 'NA')
我怎样才能改善这种代码?我目前坚持熊猫,但如果问题更好地解决使用字典我打开它。但是,请解释这个过程,我愿意学习,而不是仅仅获得答案。
感谢,
所需的输出:
contig start end haplotype_block gene_id gene_name
2 5207 5867 1856 scaffold_200003.1,scaffold_200003.1,scaffold_200003.1 CP5,CP5,CP5
# the gene_id and gene_name are repeated 3 times because three intervals (i.e 5262-5496, 5579-5750, 5856-6032) from data02 overlap(or touch) the interval ranges from data01 (5207-5867)
# So, whenever there is overlap of the ranges between two dataframe, copy the gene_id and gene_name.
# and simply NA on gene_id and gene_name for non overlapping ranges
2 155667 155670 2816 NA NA
2 67910 68022 2 NA NA
2 68464 68483 3 NA NA
2 525 775 132 scaffold_200001.1 NA
2 118938 119559 1157 NA NA
我感到困惑的目标是什么。如果你可以拼出你的输出结果,我很乐意再看一看。 – piRSquared
@piRSquared:只需添加所需的输出。我希望这是有道理的。解决这个问题的更好方法是首先对数据进行排序,然后开始**(以防止大量的for-loop),这是我已经完成的;但不在上面的data01和data02中。 – everestial007