2017-10-13 29 views
-1

好吧,所以我有一点困难,我知道它必须有一个解决方案。 我有一个13栏的数据表,但我们只关注两个(票价和pClass)。有1309行,1308有票价值,并且我想通过基于不同类的平均值(pClass)来找到缺失的值。所以我想要的是告诉R找到一行,其中Fare = NA,读取pClass(1,2或3)中的值,然后找到指定类别的平均值,然后替换票价中的缺失值与平均水平读取基于r中不同值的值

所以我想总结你的使命,谁是勇敢和善良的足以帮助我。我想找到一个缺失的值,找出它是什么类,平均具体说缺少的值类,并用正确的平均值替换缺失的值。

使用这种方法,而不是仅仅找到丢失和读取的行是更好的途径,当我在R中有多个缺失值时,我可以用正确的平均值代替,而不管决定列。

谢谢您的时间,

-Dylan

好了,因为这是太特定的继承人回答原来的问题,新计划的男孩(和女孩还有什么过你想成为IDRC作为只要你知道你在说什么)。所以!新的计划是使3个变量对应于三个不同的pClass(1,2和3)。这些pClass平均值中的每一个(将调用'em pClassAVG。(x)其中x = 1,2或3),然后让R找到票价的缺失值,并用相应pClass的pClass变量(平均值)替换它们 R的思维过程应该是这样的:“好吧,缺少的值。什么是pClass?好吧,它是2,所以我们应该用pClassAVG.2替换缺失的值。”

最后一次我得到-1因为没有包括我的代码它是

setwd("C:/Users/Maker/Desktop/Data Science/Data/Dylan T/Titanic data") 
titanic.train <- read.csv(file = "train.csv", stringsAsFactors = FALSE, header = TRUE) 
titanic.test <- read.csv(file = "test.csv", stringsAsFactors = FALSE, header = TRUE) 
# line one tells it where to look for data. line 2 & 3 tell it that hey we wanna manipulate this stuff 
#the string as factors does makes the factors line up bc we are gonna clean the data sheets togeather 
#the headers = true makes the computer understand that there are headers and to not count or read the 
#first line as data but as a title 
#currently reads incorrectly 

titanic.train$IsTrainSet <- TRUE 
titanic.test$IsTrainSet <- FALSE 
#makes a new collumb to tell us if it is the train set or test set 

titanic.test$Survived <- NA 
#makes a new collumb and fills it with NA to make the collumbs line up and have the same names 

titanic.full <- rbind(titanic.train, titanic.test) 
titanic.full[titanic.full$Embarked=='', "Embarked"] <- 'S' 
#ended day 1 at 12 minutes 

age.median <- median(titanic.full$Age, na.rm = TRUE) 
#creates a variable called age.median and assings it the median of the age collumb excluding the missing values (if we included missing 
#values it would break bc its adding an undefined numbe) 
#this method is better for replacing data that can change for example real time data that changes over the couse of the day and your 
#data gets its info updated every so often thus eliminating the problem of missing values and an incorrect median. 

titanic.full[is.na(titanic.full$Age), "Age"] <- age.median 
#table(is.na(titanic.full$Age) counts the missing values in the collumb age of titanic.full and returns true if there are missing value 

pClassAVG.1 <- median(titanic.full$Fare, na.rm = TRUE, titanic.full$Pclass == 1) 
pClassAVG.2 <- median(titanic.full$Fare, na.rm = TRUE, titanic.full$Pclass == 2) 

最后两行是我在告诉它,使前述方式pClassAVG.1尝试和pClassAVG.2

+0

[A重复的例子,W /你的数据将是有益的(https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Tung

+0

Dylan,对于你的下一个问题,请看看这个链接@thecatalyst刚刚提供ed – Thai

回答

0
df <- data_frame(Fare=c(10,20,30,40,50,60,NA,70,80), pClass=c(1,2,3,1,2,3,1,2,3)) 

a <- df$pClass[which(is.na(df$Fare))] # find the pClass where Fare is missing 

df$Fare[which(is.na(df$Fare))] <- mean(df$Fare[df$pClass==a], na.rm=T) # replace the missinf Fare with mean of corresponding pClass 

这只能如果缺失

+0

Fare = c和pClass = c做什么? – Dylan

+0

@Dylan c()创建一个向量,然后将其分配给变量Fare和pClass。这些变量然后用作列以创建df – Swapnil

0

这必须努力...让我知道是否有可能与apply更优雅的解决方案,它不

车费的价值...但是这作品以及

#Creating a data frame named df 
fare<- c(6,8,3,NA,5,1,0,7,NA,4,1,8,6,NA,2) 
pclass<- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3) 
df<-as.data.frame(cbind(fare,pclass)) 

#Creating a loop to look at each row 
for(i in 1:length(df$fare)){ 

#And if the value for fare is missing 
if(is.na(df$fare[i])){ 

#then, replace with the mean according to the group defined in pclass 
df$fare[i]<- mean(df$fare[df$pclass==df$pclass[i]],na.rm = TRUE) 

} 
}