2013-09-22 62 views
3

任何人都可以帮助我将这些数据从文本或dat文件导入到R中。它有空间分隔,但城市名称不应被视为两个名称。像纽约一样。将原始数据导入R

1 NEW YORK 7,262,700 
2 LOS ANGELES 3,259,340 
3 CHICAGO 3,009,530 
4 HOUSTON 1,728,910 
5 PHILADELPHIA 1,642,900 
6 DETROIT 1,086,220 
7 SAN DIEGO 1,015,190 
8 DALLAS 1,003,520 
9 SAN ANTONIO 914,350 
10 PHOENIX 894,070 

回答

4

为您的特定数据帧,其中真正的空间只有大写字母之间发生,可以考虑使用正则表达式:

gsub("(*[A-Z]) ([A-Z]+)", "\\1-\\2", "1 NEW YORK 7,262,700") 
# [1] "1 NEW-YORK 7,262,700" 
gsub("(*[A-Z]) ([A-Z]+)", "\\1-\\2", "3 CHICAGO 3,009,530") 
# [1] "3 CHICAGO 3,009,530" 

然后你可以解释空格作为字段分隔。

+2

第二个'[A-Z]'后面应该跟一个'+'而不是'*',否则最后会有一个城市的“芝加哥”。 –

+0

谢谢休! – Mike

1

在@休的答案扩展我会尝试以下,虽然它不是特别有效。

lines <- scan("cities.txt", sep="\n", what="character") 
lines <- unlist(lapply(lines, function(x) { 
    gsub(pattern="(*[a-zA-Z]) ([a-zA-Z]+)", replacement="\\1-\\2", x) 
})) 

citiesDF <- data.frame(num = rep(0, length(lines)), 
         city = rep("", length(lines)), 
         population = rep(0, length(lines)), 
         stringsAsFactors=FALSE) 

for (i in 1:length(lines)) { 
    splitted = strsplit(lines[i], " +") 
    citiesDF[i, "num"] <- as.numeric(splitted[[1]][1]) 
    citiesDF[i, "city"] <- gsub("-", " ", splitted[[1]][2]) 
    citiesDF[i, "population"] <- as.numeric(gsub(",", "", splitted[[1]][3])) 
} 
+0

谢谢Manetheran – Mike

4

上的主题的变化...但第一,一些示例数据:

cat("1 NEW YORK 7,262,700", 
    "2 LOS ANGELES 3,259,340", 
    "3 CHICAGO 3,009,530", 
    "4 HOUSTON 1,728,910", 
    "5 PHILADELPHIA 1,642,900", 
    "6 DETROIT 1,086,220", 
    "7 SAN DIEGO 1,015,190", 
    "8 DALLAS 1,003,520", 
    "9 SAN ANTONIO 914,350", 
    "10 PHOENIX 894,070", sep = "\n", file = "test.txt") 

步骤1:阅读与readLines

x <- readLines("test.txt") 

数据步骤2:找出可以用来插入分隔符的正则表达式。在这里,模式似乎是(从行的结尾看)一组数字和逗号,前面加空格,前面加上ALL CAPS中的一些单词。我们可以捕获这些组并插入一些“制表符”分隔符(\t)。额外的斜线正确地逃脱它们。

gsub("([A-Z ]+)(\\s?[0-9,]+$)", "\\\t\\1\\\t\\2", x) 
# [1] "1\t NEW YORK \t7,262,700"  "2\t LOS ANGELES \t3,259,340" 
# [3] "3\t CHICAGO \t3,009,530"  "4\t HOUSTON \t1,728,910"  
# [5] "5\t PHILADELPHIA \t1,642,900" "6\t DETROIT \t1,086,220"  
# [7] "7\t SAN DIEGO \t1,015,190" "8\t DALLAS \t1,003,520"  
# [9] "9\t SAN ANTONIO \t914,350" "10\t PHOENIX \t894,070" 

步骤3:因为我们知道我们的gsub工作,我们知道,read.delim具有可以用来代替“file”的说法是“text”的说法,我们可以直接使用read.delimgsub结果:

out <- read.delim(text = gsub("([A-Z ]+)(\\s?[0-9,]+$)", "\\\t\\1\\\t\\2", x), 
        header = FALSE, strip.white = TRUE) 
out 
# V1   V2  V3 
# 1 1  NEW YORK 7,262,700 
# 2 2 LOS ANGELES 3,259,340 
# 3 3  CHICAGO 3,009,530 
# 4 4  HOUSTON 1,728,910 
# 5 5 PHILADELPHIA 1,642,900 
# 6 6  DETROIT 1,086,220 
# 7 7 SAN DIEGO 1,015,190 
# 8 8  DALLAS 1,003,520 
# 9 9 SAN ANTONIO 914,350 
# 10 10  PHOENIX 894,070 

一个可能的最后一步是将第三列转换为数值:

out$V3 <- as.numeric(gsub(",", "", out$V3)) 
+0

谢谢Mahto – Mike