好吧,所以我有一点困难,我知道它必须有一个解决方案。 我有一个13栏的数据表,但我们只关注两个(票价和pClass)。有1309行,1308有票价值,并且我想通过基于不同类的平均值(pClass)来找到缺失的值。所以我想要的是告诉R找到一行,其中Fare = NA,读取pClass(1,2或3)中的值,然后找到指定类别的平均值,然后替换票价中的缺失值与平均水平读取基于r中不同值的值
所以我想总结你的使命,谁是勇敢和善良的足以帮助我。我想找到一个缺失的值,找出它是什么类,平均具体说缺少的值类,并用正确的平均值替换缺失的值。
使用这种方法,而不是仅仅找到丢失和读取的行是更好的途径,当我在R中有多个缺失值时,我可以用正确的平均值代替,而不管决定列。
谢谢您的时间,
-Dylan
好了,因为这是太特定的继承人回答原来的问题,新计划的男孩(和女孩还有什么过你想成为IDRC作为只要你知道你在说什么)。所以!新的计划是使3个变量对应于三个不同的pClass(1,2和3)。这些pClass平均值中的每一个(将调用'em pClassAVG。(x)其中x = 1,2或3),然后让R找到票价的缺失值,并用相应pClass的pClass变量(平均值)替换它们 R的思维过程应该是这样的:“好吧,缺少的值。什么是pClass?好吧,它是2,所以我们应该用pClassAVG.2替换缺失的值。”
最后一次我得到-1因为没有包括我的代码它是
setwd("C:/Users/Maker/Desktop/Data Science/Data/Dylan T/Titanic data")
titanic.train <- read.csv(file = "train.csv", stringsAsFactors = FALSE, header = TRUE)
titanic.test <- read.csv(file = "test.csv", stringsAsFactors = FALSE, header = TRUE)
# line one tells it where to look for data. line 2 & 3 tell it that hey we wanna manipulate this stuff
#the string as factors does makes the factors line up bc we are gonna clean the data sheets togeather
#the headers = true makes the computer understand that there are headers and to not count or read the
#first line as data but as a title
#currently reads incorrectly
titanic.train$IsTrainSet <- TRUE
titanic.test$IsTrainSet <- FALSE
#makes a new collumb to tell us if it is the train set or test set
titanic.test$Survived <- NA
#makes a new collumb and fills it with NA to make the collumbs line up and have the same names
titanic.full <- rbind(titanic.train, titanic.test)
titanic.full[titanic.full$Embarked=='', "Embarked"] <- 'S'
#ended day 1 at 12 minutes
age.median <- median(titanic.full$Age, na.rm = TRUE)
#creates a variable called age.median and assings it the median of the age collumb excluding the missing values (if we included missing
#values it would break bc its adding an undefined numbe)
#this method is better for replacing data that can change for example real time data that changes over the couse of the day and your
#data gets its info updated every so often thus eliminating the problem of missing values and an incorrect median.
titanic.full[is.na(titanic.full$Age), "Age"] <- age.median
#table(is.na(titanic.full$Age) counts the missing values in the collumb age of titanic.full and returns true if there are missing value
pClassAVG.1 <- median(titanic.full$Fare, na.rm = TRUE, titanic.full$Pclass == 1)
pClassAVG.2 <- median(titanic.full$Fare, na.rm = TRUE, titanic.full$Pclass == 2)
最后两行是我在告诉它,使前述方式pClassAVG.1尝试和pClassAVG.2
[A重复的例子,W /你的数据将是有益的(https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Tung
Dylan,对于你的下一个问题,请看看这个链接@thecatalyst刚刚提供ed – Thai