2014-05-03 32 views
-2

我想将原始数据从文本文件转换为矩阵。我已经使用readLines()读取数据,然后将数据与grepl()(即男性; 20; 30.5 =>“男性”“20”“30.5”)分隔到列表中。标准化R中的原始数据 - 缺失值

唯一的问题是数据缺少一些没有记录性别,年龄或体重或逗号取代小数点的值。在这种情况下,数据列表包含了这个样子行:

##"male" "20" "55.3" 
##"male" "45" 

##"" "55" "55" 

我想申请一个功能,通过附加的NA纠正这些实例。然后将该功能应用于lapply(data.dataList, function)。在[R功能都不是我的最强点,但这里是我的第一次尝试:

# function to correct column order for weight data 
f.assignFields <- function(x) { 
# create a blank character vector of length 3 
out <- character(3) 
sex <- grepl("[[:alpha:]]",x) 
out[1] <- x[sex] 
age.num <- which(as.numeric(x) <0) 
out[2] <- ifelse(length(length(age.num) > 0, x[age.num], NA) 
weight.num <- which(as.numeric(x) > 0) 
out[3] <- ifelse(length(weight.num) > 0, x[weight.num], NA) 
out 
} 

data.standardFields <- lapply(data.dataList, fassignFields) 

我知道我想要把字符串以信为先列,并把其他人在第二和第四位。我也应该用“”替换“,”。在申请之前或之后的权重lapply()?只需在正确的方向稍微推动一下就会很感激。

编辑: 从文本文件中绘制的数据非常小。只有9个人记录他们的性别,年龄和体重。练习的重点是通过修改和转换数据来处理原始数据,以检查自己修改数据的有用性,而不是使用read.table()

male;28;81.3 
male;45; 
female; 17 ;57,2 
female;64;62.8 
male;16;55.3 
male;;50,1 
female;20.4;55 
female;; 
;55;55 

这里就是我所做的:

#read text file 
weight.data <- readLines(text.txt)   

#removed white spaces 
weight.data <- gsub(" ","",weight.data) 
weight.data 

[1] "male;28;81.3"  
[2] "male;45;"  
[3] "female;17;57,2" 
[4] "female;64;62.8" 
[5] "male;16;55.3" 
[6] "male;;50,1"  
[7] "female;20.4;55"  
[8] "female;;"   
[9] ";55;55" 

#split strings by semicolon 
weight.dataList <-strsplit(weight.data, split = ";") 
weight.dataList 

[[1]] 
[1] "male" "28" "81.3" 

[[2]] 
[1] "male" "45" 

[[3]] 
[1] "female" "17"  "57,2" 

[[4]] 
[1] "female" "64" "62.8" 

[[5]] 
[1] "male" "16" "55.3" 

[[6]] 
[1] "male" ""  "50,1" 

[[7]] 
[1] "female" "20.4" "55" 

[[8]] 
[1] "female" "" 

[[9]] 
[1] "" "55" "55" 

我想NA的添加缺少的行。我正在尝试创建一个函数来纠正字段的行方向。例如,第二个条目的重量应该是NA。

# function to correct column order and size for weight data 
f.assignFields <- function(x) { 
# create a blank character vector of length 3 
out <- character(3) 
sex <- grepl("[[:alpha:]]",x) 
# puts sex in first column 
out[1] <- x[sex] 
# assigns NA if age missing 
age.num <- which(as.numeric(x) <0) 
out[2] <- ifelse(length(length(age.num) > 0, x[age.num], NA) 
# assigns NA if weight missing 
weight.num <- which(as.numeric(x) > 0) 
out[3] <- ifelse(length(weight.num) > 0, x[weight.num], NA) 
out 
} 

data.standardFields <- lapply(data.dataList, fassignFields) 

最后,我将使用unlist()matrix()来转换数据以行 - 列格式。我想用NA代替数据的缺失值,将数据按照“性别,年龄,体重”的顺序放置并固定权重,以使55.1显示为55.1。

+0

重复的例子和期望的输出可以派上用场 –

+1

好像你可以只使用函数read.table(... 09月=“”,填写= T)或东西。你能否提供更多的原始数据样本?另外,data.dataList函数来自哪里?它期望什么输入? – MrFlick

+0

TL; DR--让它更加简洁,同时更具体。 –

回答

0

最简单的方法是使用read.table,但看起来你的教授正试图折磨你。任何地方的数据都不会被列为年龄的20.4。

> ## txt <- "male;28;81.3 
    ## male;45; 
    ## female; 17 ;57,2 
    ## female;64;62.8 
    ## male;16;55.3 
    ## male;;50,1 
    ## female;20.4;55 
    ## female;; 
    ## ;55;55" 
> x <- gsub("\\s+", "", readLines(textConnection(txt))) 
> rpl.comma <- gsub(",", ".", x) 
> spl <- strsplit(rpl.comma, ";") 
> M <- matrix(0, nrow = length(x), ncol = 3) 
> for(j in 1:3){ 
    M[,j] <- sapply(seq(spl), function(i){ 
     ifelse(spl[[i]][j] == "", "NA", spl[[i]][j]) 
    }) 
    } 
> DF <- data.frame(M) 
> names(DF) <- c("sex", "age", "weight") 
> DF 
##  sex age weight 
## 1 male 28 81.3 
## 2 male 45 <NA> 
## 3 female 17 57.2 
## 4 female 64 62.8 
## 5 male 16 55.3 
## 6 male NA 50.1 
## 7 female 20.4  55 
## 8 female NA <NA> 
## 9  NA 55  55