有没有更快的方法将字符串拆分成给定长度的子字符串？

我有一些列固定宽度的天气数据，但长度取决于变量（见下面，来自GHCN的数据，http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt）。有没有更快的方法将字符串拆分成给定长度的子字符串？

我想将它们拆分成data.frame，并在@GSee（How to split a string into substrings of a given length?）的建议之后编写了一些代码。但是，处理6000行花了大约4.3秒。

有没有更快的方法来处理这个数据集？

感谢您的任何建议。 WITH阿南达Mahto评论

temp <- readLines(textConnection("NO000050550193801TMAX 53 I 51 I 10 I 22 I 56 I 31 I 30 I 24 I 38 I 25 I 2 I 32 I 75 I 71 I 98 I 96 I 57 I 55 I 54 I 60 I 91 I 75 I 94 I 82 I 89 I 46 I 26 I 68 I 62 I 46 I 37 I 
NO000050550193801TMIN 25 I -6 I -27 I 0 I 3 I -14 I -8 I 11 I 10 I -11 I -30 I -23 I 22 I 38 I 47 I 33 I 13 I 5 I 10 I 29 I 42 I 45 I 51 I 44 I 35 I 5 I -16 I -20 I 5 I 2 I 5 I 
NO000050550193802TMAX 69 I 58 I 71 I 90 I 77 I 70 I 56 I 46 I 58 I 32 I 32 I 22 I 25 I 30 I 29 I 29 I 34 I 88 I 58 I 50 I 45 I 62 I 38 I 40 I 59 I 112 I 92 I 77 I-9999 -9999 -9999 
NO000050550193802TMIN 11 I 26 I 16 I 35 I 44 I 21 I 19 I 22 I 20 I 6 I 6 I -16 I -22 I -39 I -28 I -35 I -33 I -21 I -13 I 15 I 26 I 17 I -1 I 9 I 18 I 38 I 58 I 28 I-9999 -9999 -9999 
NO000050550193803TMAX 81 I 84 I 89 I 86 I 86 I 74 I 54 I 74 I 83 I 64 I 75 I 77 I 66 I 91 I 82 I 84 I 89 I 84 I 94 I 85 I 82 I 89 I 74 I 84 I 81 I 58 I 72 I 58 I 86 I 84 I 89 I 
NO000050550193803TMIN 31 I 25 I 29 I 45 I 61 I 20 I 9 I 8 I 38 I 31 I 9 I 39 I 27 I 56 I 48 I 65 I 45 I 54 I 46 I 42 I 43 I 36 I 56 I 61 I 15 I -2 I -11 I -2 I 12 I 30 I 24 I")) 

temp <- rep(temp, 1000) 
system.time({ 

out <- strsplit(temp, '') 
out <- as.matrix(do.call(rbind, out)) 
pos_matrix <- matrix(c(12, 16, 18, seq(0, 30) * 8 + 22, 
    15, 17, 21, seq(0, 30) * 8 + 26), ncol = 2) 
out <- apply(out, 1, function(x) 
    { 
     apply(pos_matrix, 1, function(y) 
      paste(x[y[1]:y[2]], collapse = '')) 
    }) 
}) 

user system elapsed 
4.46 0.01 4.52

编辑：

system.time({ 
pos_matrix <- matrix(c(12, 16, 18, seq(0, 30) * 8 + 22, 
    15, 17, 21, seq(0, 30) * 8 + 26), ncol = 2) 
pos_matrix <- lapply(seq(1, nrow(pos_matrix)), function(x) 
    { 
     sprintf('substr(V1, %s, %s) f%s', 
      pos_matrix[x,1], pos_matrix[x,2], x) 
    }) 
pos_matrix <- paste(pos_matrix, collapse = ', ') 
out <- data.frame(V1 = temp) 

out <- sqldf(sprintf('select %s from out', pos_matrix)) 
}) 

user system elapsed 
0.4  0.0  0.4

WITH jlhoward建议编辑：

system.time({ 
pos_matrix <- matrix(c(12, 16, 18, seq(0, 30) * 8 + 22, 
    15, 17, 21, seq(0, 30) * 8 + 26), ncol = 2) 
out <- apply(pos_matrix, 1, function(x) 
    { 
     substr(temp, x[1], x[2]) 
    }) 
}) 
user system elapsed 
0.04 0.00 0.04

来源

2013-12-18 Bangyou

按照[示例6f here]（http://code.google.com/p/sqldf/）中的说明使用'sqldf'和'substr'？ – A5C1D2H2I1M1N2O1R2T1

sqldf和substr要快得多。相同的数据集只需要0.4秒。你的男人将你的评论添加到答案中，然后我可以接受它。 – Bangyou

分析您的代码（?Rprof）显示2/3的执行时间花费在paste(...)，这并不令人惊讶。它看起来像是将输入分解为单个字符，然后根据pos_matrix(...)重新组合它们。使用具有起始位置和长度的矩阵的substr(...)可能更有效。

编辑：添加代码来实现上述

vec <- as.vector(temp) 
pos_matrix <- matrix(c(12, 16, 18, seq(0, 30) * 8 + 22, 
         15, 17, 21, seq(0, 30) * 8 + 26), ncol = 2) 
pos <- t(pos_matrix) 
system.time(
out <- do.call(rbind,list(apply(pos,2,function(x){substr(vec,x[1],x[2])}))) 
) 
# user system elapsed 
# 0.09 0.00 0.09

来源

2013-12-18 20:40:39 jlhoward

感谢您的建议。相同的数据集只需要0.4秒，而与sqldf的速度相似（但不需要加载sqldf软件包）。 – Bangyou

很高兴为你效劳。我已经添加了上面的代码，但看起来你已经明白了。 – jlhoward

有一个固定宽度在utils包读取功能（默认加载）：

m <- matrix(c(12, 16, 18, seq(0, 30) * 8 + 22, 
    15, 17, 21, seq(0, 30) * 8 + 26), ncol = 2) 
read.fwf(textConnection(temp), c(11,    # which you are apparently ignoring 
           m[,2]-m[,1]+1) )

但是对于至少我有6000个这样的记录需要9秒。

来源

2013-12-18 20:02:31

scan建议 - 这与文件或连接工作。它可以修改代码以上面给出更方便地与temp工作：

writeLines(temp, "temp.txt") 
scan("temp.txt", what="")) 
# and now convert it to a matrix of appropriate size

不知道这是不是基于sqldf的解决方案更快，但它看起来更直接给我。

[[备注]]您好，您问过“给定长度的子串”，所以技术上我的答案是关于其他的东西。但它看起来像这个例子中的文件可能实际上有帮助。

来源

2013-12-18 21:27:00 lebatsnok

有没有更快的方法将字符串拆分成给定长度的子字符串？

回答

相关问题