R矢量化findInterval大型数据集

我有两个数据框，我正在使用findInterval。井眼数据是产生油的井眼的x，y和z的数据（VSS =垂直海底深度，md =测量深度，也就是钻头沿井下的实际距离）。 Perfs数据是已经穿过井眼以允许流动的数据（top_perf = md，bot_perf = md）。R矢量化findInterval大型数据集

Perfs：

Well_ID top_perf bot_perf well_name surface ID x y VSS 
056-W  2808  2958  056-W  Ranger 2 0 0 0 
056-W  3150  3250  056-W  Ranger 1 0 0 0 
056-W  3150  3250  056-W  Ranger 2 0 0 0 
056-W  3559  3664  056-W  UT 1 1 0 0 0 
056-W  3559  3664  056-W  UT 2 2 0 0 0 
057-W  2471  2952  057-W  Tar 1 0 0 0 
057-W  2471  2952  057-W  Tar 2 0 0 0 
058-W  2615  2896  058-W  Ranger 1 0 0 0 
058-W  2615  2896  058-W  Ranger 2 0 0 0

井筒：

well_name well_id  md  vss   x  y   
056-W  056-W  3260 -3251.46 4221436 4030454 
056-W  056-W  3280 -3271.45 4221436 4030454 
056-W  056-W  3300 -3291.45 4221435 4030453 
056-W  056-W  3320 -3311.44 4221435 4030453 
056-W  056-W  3340 -3331.44 4221434 4030453 
056-W  056-W  3360 -3351.43 4221434 4030453 
056-W  056-W  3380 -3371.43 4221433 4030453 
056-W  056-W  3400 -3391.42 4221433 4030453

的目标是找到Perfs $ top_perf和Perfs $ bot_perf是价值最接近于井筒$ MD其中Perfs $ Well_ID =井筒$ well_id，然后从Wellbore中提取vss，x和y并将其添加到Perfs。（我不介意插入它们之间，只需要一些接近的东西）。

这里是我的代码来做到这一点：

for(i in 1:dim(Perfs)[1]){ 
    if(Perfs$ID[i] == 1){ 
    Wellbore_temp <- Wellbore[which(Wellbore$well_id == Perfs[i,"Well_ID"]),] 
    interval <- findInterval(Perfs[i,"top_perf"], Wellbore_temp$md) 
    Perfs[i,c("x","y","VSS")] <- Wellbore_temp[interval, c("x","y","vss")] 
    }else{ 
    Wellbore_temp <- Wellbore[which(Wellbore$well_id == Perfs[i,"Well_ID"]),] 
    interval <- findInterval(Perfs[i,"bot_perf"], Wellbore_temp$md) 
    Perfs[i,c("x","y","VSS")] <- Wellbore_temp[interval, c("x","y","vss")] 
    } 
}

此代码的工作，它只是迄今为止，这将在所使用的应用程序，我怎样才能摆脱循环的和做太慢。更加矢量化的方式来加快速度？也打开findInterval之外的建议。

来源

2016-11-29 Andrew Pruet

找到了问题的答案在这里：基于由@ ds440

这里提供一个data.table的想法Join R data.tables where key values are not exactly equal--combine rows with closest times

是我使用的代码，它运行速度非常快：

Perf.Data <- Perfs 


Wellbore.Perfs <- data.table(Wellbore[,c("well_id","md","vss")]) 
Spotfire.Top.Perf <- data.table(Perf.Data[,c("Well_ID","top_perf", "bot_perf")]) 
Spotfire.Bot.Perf <- data.table(Perf.Data[,c("Well_ID","bot_perf", "top_perf")]) 

#Change the column names to match up with Wellbore.Perfs 
#Add in the bot_perf to .top.perf and the top_perf to the .bot.perf is done to make these unique and ensure everything is captured from the perfs table 
colnames(Spotfire.Top.Perf) <- c("well_id","md", "bot_perf") 
colnames(Spotfire.Bot.Perf) <- c("well_id","md","top_perf") 

#set key to join on 
setkey(Wellbore.Perfs, "well_id","md") 

#roll = "nearest" will take the nearest value of md in .top.perf or .bot.perf and match it to the md in wellbore.perfs where Well_ID = Well_ID 
Perfs.Wellbore.Top <- Wellbore.Perfs[Spotfire.Top.Perf, roll = "nearest"] 
Perfs.Wellbore.Bot <- Wellbore.Perfs[Spotfire.Bot.Perf, roll = "nearest"]

来源

2016-11-29 20:44:38

如果只是在显示的示例数据上运行，则某些“最近的”匹配相当糟糕，但除非您计算距离，否则您不会看到。 – ds440

我应该在原始文章中提到这一点，但是Wellbore表格的MD范围从0到井底，每个井以20为增量。好点子。 –

下面我介绍一个data.table解决方案。我只在你显示的小部分数据上进行了测试，在小数据集上它的运行速度比解决方案慢，但我认为它可能更好。如果不是，请考虑并行化。

如果您还没有使用data.table，我认为它通常很快，但语法可能有点复杂。 .SD指的是连接到perfs数据的第i行的井筒数据的子集（迭代通过.EACHI）。这节省了一切的一切庞然大物。我没有使用findInterval函数，而是计算一个错误（top_perf - md或bot_perf - md），并将绝对误差最小化。这种方法在滚动连接（'最接近'）上的优点是您可以查看错误是什么，并在必要时进行过滤。

library(data.table) 

Perfs <- fread(input = 'Well_ID top_perf bot_perf well_name surface ID x y VSS 
056-W  2808  2958  056-W  Ranger 2 0 0 0 
056-W  3150  3250  056-W  Ranger 1 0 0 0 
056-W  3150  3250  056-W  Ranger 2 0 0 0 
056-W  3559  3664  056-W  UT_1 1 0 0 0 
056-W  3559  3664  056-W  UT_2 2 0 0 0 
057-W  2471  2952  057-W  Tar 1 0 0 0 
057-W  2471  2952  057-W  Tar 2 0 0 0 
058-W  2615  2896  058-W  Ranger 1 0 0 0 
058-W  2615  2896  058-W  Ranger 2 0 0 0') 

Wellbore <- fread(input = 'well_name well_id  md  vss   x  y   
056-W  056-W  3260 -3251.46 4221436 4030454 
056-W  056-W  3280 -3271.45 4221436 4030454 
056-W  056-W  3300 -3291.45 4221435 4030453 
056-W  056-W  3320 -3311.44 4221435 4030453 
056-W  056-W  3340 -3331.44 4221434 4030453 
056-W  056-W  3360 -3351.43 4221434 4030453 
056-W  056-W  3380 -3371.43 4221433 4030453 
056-W  056-W  3400 -3391.42 4221433 4030453') 


#top 
setkey(Wellbore, 'well_id') 
setkey(Perfs, 'Well_ID', 'top_perf') 
top_matched <- Wellbore[unique(Perfs), .SD[which.min(abs(top_perf-md)),.(md, top_perf, err=top_perf-md, x,y,vss)],nomatch=0, by=.EACHI] 
setkey(top_matched, 'well_id', 'top_perf') 
top_joined <- top_matched[Perfs] 
top_joined[,`:=`(i.x=NULL, i.y=NULL,VSS=NULL)] 
setnames(top_joined, old=c('err', 'x', 'y', 'vss'), new=paste0('top_', c('err', 'x', 'y', 'vss'))) 

#bottom 
setkey(Perfs, 'Well_ID', 'bot_perf') 
bot_matched <- Wellbore[unique(Perfs), .SD[which.min(abs(bot_perf-md)),.(md, bot_perf, err=bot_perf-md, x,y,vss)],nomatch=0, by=.EACHI] 
setkey(bot_matched, 'well_id', 'bot_perf') 
bot_joined <- bot_matched[Perfs] 
bot_joined[,`:=`(i.x=NULL, i.y=NULL,VSS=NULL)] 
setnames(bot_joined, old=c('err', 'x', 'y', 'vss'), new=paste0('bot_', c('err', 'x', 'y', 'vss'))) 


answer <- cbind(top_joined[,c(1:2,9:11,3:7), with=F], bot_joined[,3:7,with=F]) 

# well_id md well_name surface ID top_perf top_err top_x top_y top_vss bot_perf bot_err 
# 1: 056-W 3260  056-W Ranger 2  2808 -452 4221436 4030454 -3251.46  2958 -302 
# 2: 056-W 3260  056-W Ranger 1  3150 -110 4221436 4030454 -3251.46  3250  -10 
# 3: 056-W 3260  056-W Ranger 2  3150 -110 4221436 4030454 -3251.46  3250  -10 
# 4: 056-W 3400  056-W UT_1 1  3559  159 4221433 4030453 -3391.42  3664  264 
# 5: 056-W 3400  056-W UT_2 2  3559  159 4221433 4030453 -3391.42  3664  264 
# 6: 057-W NA  057-W  Tar 1  2471  NA  NA  NA  NA  2952  NA 
# 7: 057-W NA  057-W  Tar 2  2471  NA  NA  NA  NA  2952  NA 
# 8: 058-W NA  058-W Ranger 1  2615  NA  NA  NA  NA  2896  NA 
# 9: 058-W NA  058-W Ranger 2  2615  NA  NA  NA  NA  2896  NA 
# bot_x bot_y bot_vss 
# 1: 4221436 4030454 -3251.46 
# 2: 4221436 4030454 -3251.46 
# 3: 4221436 4030454 -3251.46 
# 4: 4221433 4030453 -3391.42 
# 5: 4221433 4030453 -3391.42 
# 6:  NA  NA  NA 
# 7:  NA  NA  NA 
# 8:  NA  NA  NA 
# 9:  NA  NA  NA

来源

2016-11-29 04:03:03 ds440

谢谢，它肯定比较好，但使用中点和错误函数会产生一些问题，如top_perf - bot_perf = 300，中点不是非常接近那些perfs实际上是沿着井眼/ top_perf和bot_perf之间没有区别。这样做的目的是绘制地图，我无法展示距离实际150英尺的地点。 –

您可以分别使用'top_perf'和'bot_perf'（而不是'midpt'）创建'top_matched'和'bottom_matched'。如果您计算错误，则可以过滤掉不相关的匹配项。 – ds440

我编辑了现在摆脱中点，现在有md和顶部或底部之间的直接比较。 – ds440

R矢量化findInterval大型数据集

回答

相关问题