2013-01-24 191 views
1

我想使用单个正则表达式从字符串中提取几条数据。我做了一个模式,其中包括这些作为子表达式在括号中的作品。在类似perl的环境中,我只是简单地通过代码myvar1=$1; myvar2=$2;等将这些子表达式传递给变量 - 但是如何在R中执行此操作? 目前,我发现访问这些事件的唯一方法是通过regexec。这不是很方便,因为regexec不支持perl语法和其他原因。这就是我现在要做的:R:从正则表达式中提取子表达式出现

getoccurence <- function(text,rex,n) { # rex is the result of regexec function 
    occstart <- rex[[1]][n+1] 
    occstop <- occstart+attr(rex[[1]],'match.length')[n+1]-1 
    occtext <- substr(text,occstart[i],occstop) 
    return(occtext) 
} 
mytext <- "junk text, 12.3456, -01.234, valuable text before comma, all the rest" 
mypattern <- "([0-9]+\\.[0-9]+), (-?[0-9]+\\.[0-9]+), (.*)," 
rez <- regexec(mypattern, mytext) 
var1 <- getoccurence(mytext, rez, 1) 
var2 <- getoccurence(mytext, rez, 2) 
var3 <- getoccurence(mytext, rez, 3) 

显然,它是相当笨拙的解决方案,应该有更好的东西。我会很感激任何意见。

回答

2

你看过regmatches吗?

> regmatches(mytext, rez) 
[[1]] 
[1] "12.3456, -01.234, valuable text before comma," "12.3456"          
[3] "-01.234"      "valuable text before comma"     

> sapply(regmatches(mytext, rez), function(x) x[4]) 
[1] "valuable text before comma" 
+0

哎哟,的确!我当然读了regmatches的描述,但不知何故忽略了这一点:(非常感谢你!!! –

+0

P.S.现在我明白了为什么:我试图只在regexpr之后使用regmatches,而不是在regexec之后... –

1

stringr,这是str_matchstr_match_all(如果你想在字符串中的模式的每次出现匹配。str_match返回一个矩阵,str_match_all返回矩阵

library(stringr) 
str_match(mytext, mypattern) 
str_match_all(mytext, mypattern) 
1

strapply和列表strapplycgsubfn package可以做到这一步:

> strapplyc(mytext, mypattern) 
[[1]] 
[1] "12.3456"     "-01.234"     
[3] "valuable text before comma" 

> # with simplify = c argument 
> strapplyc(mytext, mypattern, simplify = c) 
[1] "12.3456"     "-01.234"     
[3] "valuable text before comma" 

> # extract second element only 
> strapply(mytext, mypattern, ... ~ ..2) 
[[1]] 
[1] "-01.234" 

> # specify function slightly differently and use simplify = c 
> strapply(mytext, mypattern, ... ~ list(...)[2], simplify = c) 
[1] "-01.234" 

> # same 
> strapply(mytext, mypattern, x + y + z ~ y, simplify = c) 
[1] "-01.234" 

> # same but also convert to numeric - also can use with other variations above 
> strapply(mytext, mypattern, ... ~ as.numeric(..2), simplify = c) 
[1] -1.234 

在上面的例子中,第三个参数可以是一个函数,也可以是一个被转换成函数的公式(LHS代表参数,RHS是body)。