2013-12-10 36 views
0

我想计算疾病对的Tanimoto系数(集合/连接的交集)。样本数据在下面,仅针对1对疾病。 其中疾病1是NK细胞缺陷和疾病2是腺苷琥珀酸裂解酶缺陷。计算Tanimoto系数

第1组是疾病1(NK细胞缺陷),其具有来自Gene1列的所有基因。

第2组是疾病2(腺嘌呤琥珀酸裂解酶缺陷症),其具有来自Gene2栏的所有基因。

**Gene1** **Gene2** **Disease1** **Disease2** 
IMPDH1 XDH NK cell defects Adenylosuccinate lyase deficiency 
PPP3R2 ADA NK cell defects Adenylosuccinate lyase deficiency 
PPP3R2 NPR1 NK cell defects Adenylosuccinate lyase deficiency 
PPP3R2 IMPDH1 NK cell defects Adenylosuccinate lyase deficiency 
PPP3R2 IMPDH2 NK cell defects Adenylosuccinate lyase deficiency 
PPP3R2 PPP3R2 NK cell defects Adenylosuccinate lyase deficiency 
PPP3R2 RRM1 NK cell defects Adenylosuccinate lyase deficiency 
NPR1 POLA1 NK cell defects Adenylosuccinate lyase deficiency 
PPP3R2 ITGAL NK cell defects Adenylosuccinate lyase deficiency 
ITGAL NPR1 NK cell defects Adenylosuccinate lyase deficiency 
CASP3 NPR1 NK cell defects Adenylosuccinate lyase deficiency 
PTK2B NPR1 NK cell defects Adenylosuccinate lyase deficiency 
TNF GUCY1A2 NK cell defects Adenylosuccinate lyase deficiency 
PTK2B GUCY1A2 NK cell defects Adenylosuccinate lyase deficiency 

任何建议,就如何做到这一点在MySQL或R

感谢,

罗汉

+0

您可以定义在这种情况下,交集和并集?可重复的数据将帮助人们回答很长的路要走。尝试在data.frame上使用'dput'。 – TheComeOnMan

+0

集合1是Disease1,其中包含Gene1中的所有基因,集合2是Disease2,其中包含Gene2列中的所有基因。交集是Gene1和Gene2中常见基因IMPDH1,PPP3R2,ITGAL,NPR1的数目。 Union是Gene1和Gene2 Column中基因的总数。 – Rgeek

回答

0

随机输入数据 -

library(data.table) 

DT = data.table(
    G1=1:5, 
    G2=3:7, 
    D1="A", 
    D2="B" 
) 

DT[, 
    list(
    intersectG = length(intersect(G1,G2)), 
    unionG = length(union(G1,G2)), 
    Tanimoto = length(union(G1,G2))/length(intersect(G1,G2)) 
    ), 
    by = c('D1','D2')] 

输出 -

D1 D2 intersectG unionG Tanimoto 
1: A B   3  7 2.333333 
+0

非常感谢@Codoremifa回答得非常好! – Rgeek

+0

太好了。考虑通过检查答案旁边的标记来接受答案。 – TheComeOnMan

0

了解搜索:

install.packages("sos") 
library("sos") 
findFn("Tanimoto") 

getGeneSim {GOSim} [R文档

计算用于基因

说明

计算成对功能上的相似,使用不同的策略基因的列表功能相似。 使用

getGeneSim(genelist1, genelist2=NULL, similarity="funSimMax", similarityTerm="relevance", 
      normalization="Tanimoto", method="sqrt", avg=(similarity=="OA"), verbose=FALSE)