2013-12-19 45 views
1

我必须在R中的一个非常奇特的特征之间提取值。在R中的括号之间提取字符串

a <- "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098} 
{3:{112:123123214321}}{4:20:asdasd3214213}" 

这是我的示例串和我想之间提取文本{[0-9]:和},使得我对上面的字符串输出看起来像

## output should be 
"0987617820" "q312132498s7yd09f8sydf987s6df8797yds9f87098", "{112:123123214321}" "20:asdasd3214213" 
+0

看起来差不多像JSON,也许“rjson”包会帮助你? –

+2

使用正则表达式很难做到这一点,因为你有一个嵌套的结构。 –

+0

恰恰是嵌套结构正在增加这个问题。 – Shreyes

回答

1

使用PERL。这种方式更强大。

a = "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098}{3:{112:123123214321}}{4:20:asdasd3214213}" 

foohacky = function(str){ 
    #remove opening bracket 
    pt1 = gsub('\\{+[0-9]:', '@@',str) 
    #remove a closing bracket that is preceded by any alphanumeric character 
    pt2 = gsub('([0-9a-zA-Z])(\\})', '\\1',pt1, perl=TRUE) 
    #split up and hack together the result 
    pt3 = strsplit(pt2, "@@")[[1]][-1] 
    pt3 
} 

例如

> foohacky(a) 
[1] "0987617820"         
[2] "q312132498s7yd09f8sydf987s6df8797yds9f87098" 
[3] "{112:123123214321}"       
[4] "20:asdasd3214213" 

它还可以与嵌套

> a = "{1:0987617820}{{3:{112:123123214321}}{4:{20:asdasd3214213}}" 
> foohacky(a) 
[1] "0987617820"   "{112:123123214321}" "{20:asdasd3214213}" 
3

这是一种可怕的劈并可能打破你的真实数据。理想情况下,你可以只使用一个分析器,但如果你坚持用正则表达式...好...这不是很

a <- "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098} 
{3:{112:123123214321}}{4:20:asdasd3214213}" 

# split based on }{ allowing for newlines and spaces 
out <- strsplit(a, "\\}[[:space:]]*\\{") 
# Make a single vector 
out <- unlist(out) 
# Have an excess open bracket in first 
out[1] <- substring(out[1], 2) 
# Have an excess closing bracket in last 
n <- length(out) 
out[length(out)] <- substring(out[n], 1, nchar(out[n])-1) 
# Remove the number colon at the beginning of the string 
answer <- gsub("^[0-9]*\\:", "", out) 

这给

> answer 
[1] "0987617820"         
[2] "q312132498s7yd09f8sydf987s6df8797yds9f87098" 
[3] "{112:123123214321}"       
[4] "20:asdasd3214213" 

你可以在一个功能包装类似的东西如果您需要为多个字符串执行此操作。

+1

请注意,如果您有多个项目嵌套在单个项目中,则会中断。 – Dason

1

这里有一个更一般的方式,它会返回{[0-9]:}之间的任何模式允许的{}内单巢比赛。

regPattern <- gregexpr("(?<=\\{[0-9]\\:)(\\{.*\\}|.*?)(?=\\})", a, perl=TRUE) 
a_parse <- regmatches(a, regPattern) 
a <- unlist(a_parse) 
相关问题