2015-11-30 69 views
0

我提前为冗长外道歉与数据观测。R替换从另一数据集

我的数据集包含例如MID二人组合(1800至2001年)

id cc1 year cap1 
4114 2 1994 . 
4113 2 1994 . 
4113 2 1996 . 

其他数据集有每个县代码一年例如

cc1 year CINC 
2 1816 0.039 
2 1817 0.035 
2 1818 0.036 

CAP1得分(从1800至2001年)我想使用第二个数据集中的CINC值来填充数据集中cap1变量中的缺失值。

要清楚,我还要补充的问题是,可以在每一年的引发超过1级或无的MID。例如,我的数据集对cap1变量将有2680个观测值。但是,第二个数据集有14199个观测值。

我搜索上计算器的论坛,并与朋友和协商提出了以下

mydata$cap1=mydata1[mydata1$ccode==mydata$cc1 & mydata1$year==mydata$year,]$cinc 

凡我MYDATA是我的数据集和mydata1是第二个数据集

返回该错误

Error in $<-.data.frame (*tmp* , "cap1", value = c(0.0396975, 0.0358166, : replacement has 3 rows, data has 2680

In addition: Warning messages:

1: In mydata1$ccode == mydata$cc1 : longer object length is not a multiple of shorter object length

2: In mydata1$year == mydata$year : longer object length is not a multiple of shorter object length

编辑


我的数据集

structure(list(idnum = c(4054L, 4186L, 4206L, 4273L, 2589L, 2587L 
), cc1 = c(365L, 2L, 640L, 2L, 541L, 630L), cap1 = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)), .Names = c("idnum", 
"cc1", "cap1"), datalabel = "", time.stamp = "29 Nov 2015 23:36", formats = c("%8.0g", 
"%8.0g", "%9.0g"), types = c(65529L, 65529L, 65527L), val.labels = c("", 
"", ""), var.labels = c("MID ID number", "Initiator country code", 
"Initiator's capabilities"), version = 117L, label.table = list(), expansion.fields = list(), strl = list(), byteorder = "LSF", row.names = 2675:2680, class = "data.frame") 

第二个数据集

structure(list(stateabb = structure(c(208L, 208L, 208L, 208L, 
208L, 208L), .Label = c("AAB", "AFG", "ALB", "ALG", "AND", "ANG", 
"ARG", "ARM", "AUH", "AUL", "AUS", "AZE", "BAD", "BAH", "BAR", 
"BAV", "BEL", "BEN", "BFO", "BHM", "BHU", "BLR", "BLZ", "BNG", 
"BOL", "BOS", "BOT", "BRA", "BRU", "BUI", "BUL", "CAM", "CAN", 
"CAO", "CAP", "CDI", "CEN", "CHA", "CHL", "CHN", "COL", "COM", 
"CON", "COS", "CRO", "CUB", "CYP", "CZE", "CZR", "DEN", "DJI", 
"DMA", "DOM", "DRC", "DRV", "ECU", "EGY", "EQG", "ERI", "EST", 
"ETH", "ETM", "FIJ", "FIN", "FRN", "FSM", "GAB", "GAM", "GDR", 
"GFR", "GHA", "GMY", "GNB", "GRC", "GRG", "GRN", "GUA", "GUI", 
"GUY", "HAI", "HAN", "HON", "HSE", "HSG", "HUN", "ICE", "IND", 
"INS", "IRE", "IRN", "IRQ", "ISR", "ITA", "JAM", "JOR", "JPN", 
"KEN", "KIR", "KOR", "KUW", "KYR", "KZK", "LAO", "LAT", "LBR", 
"LEB", "LES", "LIB", "LIE", "LIT", "LUX", "MAA", "MAC", "MAD", 
"MAG", "MAL", "MAS", "MAW", "MEC", "MEX", "MLD", "MLI", "MLT", 
"MNC", "MNG", "MOD", "MON", "MOR", "MSI", "MYA", "MZM", "NAM", 
"NAU", "NEP", "NEW", "NIC", "NIG", "NIR", "NOR", "NTH", "OMA", 
"PAK", "PAL", "PAN", "PAP", "PAR", "PER", "PHI", "PMA", "PNG", 
"POL", "POR", "PRK", "QAT", "ROK", "ROM", "RUS", "RVN", "RWA", 
"SAF", "SAL", "SAU", "SAX", "SEN", "SEY", "SIC", "SIE", "SIN", 
"SKN", "SLO", "SLU", "SLV", "SNM", "SOL", "SOM", "SPN", "SRI", 
"STP", "SUD", "SUR", "SVG", "SWA", "SWD", "SWZ", "SYR", "TAJ", 
"TAW", "TAZ", "THI", "TKM", "TOG", "TON", "TRI", "TUN", "TUR", 
"TUS", "TUV", "UAE", "UGA", "UKG", "UKR", "URU", "USA", "UZB", 
"VAN", "VEN", "WRT", "WSM", "YAR", "YEM", "YPR", "YUG", "ZAM", 
"ZAN", "ZIM"), class = "factor"), ccode = c(990L, 990L, 990L, 
990L, 990L, 990L), year = 2002:2007, irst = c(0L, 0L, 0L, 0L, 
0L, 0L), milex = c(0L, 0L, 0L, 0L, 0L, 0L), milper = c(0L, 0L, 
0L, 0L, 0L, 0L), pec = c(47.0876, 44.54492, 42.7761, 43.50082, 
44.62415, 44.93124), tpop = c(178L, 180L, 182L, 183L, 185L, 187L 
), upop = c(0L, 0L, 0L, 0L, 0L, 0L), cinc = c(5.12e-06, 5.08e-06, 
5.05e-06, 5.01e-06, 5.01e-06, 4.99e-06), version = c(4L, 4L, 
4L, 4L, 4L, 4L)), .Names = c("stateabb", "ccode", "year", "irst", 
"milex", "milper", "pec", "tpop", "upop", "cinc", "version"), row.names = 14194:14199, class = "data.frame") 

我试图

require(foriegn) 
require(readstata13) 
mydata=read.dta13("Schultz-Geddes.dta") 
mydata1=read.csv("NMC_v4_0-3.csv") 
mydata$cap1=mydata1[mydata1$ccode==mydata$cc1 & mydata1$year==mydata$year,]$cinc 
install.packages("dplyr") 
library(dplyr) 
joined <- mydata %>% 
    left_join(mydata1, c("cc1", "year")) 
+1

改变cap1列使用CICN请告诉我们预期的输出。 –

+0

@TimBiegeleisen 不清楚自己需要什么。其实我不知道是什么的输出会比其他从第二个数据集的CINC得分填充CAP1观察每个MID编号 – darwinjoe1831

+0

你的第一个数据集不包含'year'场,和你的第二个数据集不包含一个'cc1'字段,所以你不能加入这两列。 –

回答

1

希望将两个数据集连接在一起。

library(dplyr) 
joined <- your_dataset %>% 
    inner_join(the_other_dataset, c("cc1", "year")) 

(这不是从你的问题完全清楚哪些数据集的内容,所以你可能需要一个left_join而不是inner_join,您还可以使用merge从基地-R)。

然后你可以当它缺少

joined %>% 
    mutate_(cap1 = ~ ifelse(is.na(cap1), CICN, cap1)) 
+0

这似乎并没有工作 加入<- mydata %>% left_join(mydata1,C( “CC1”, “年”)) 给了我这个错误 “错误:可以在列 'CC1' 不加入X 'CC1' :索引超出范围“ 谢谢 – darwinjoe1831

+0

@ darwinjoe1831时间让您的问题可重现。 –

+0

@ darwinjoe1831请[edit](http://stackoverflow.com/posts/33992015/edit)你的问题。 –