R：拆分字符串＆根据拆分分配变量

我有一个语义标签字段&语义标签类型。每个标签类型/标签用逗号分隔，而每个标签类型&标签以冒号分隔（见下文）。R：拆分字符串＆根据拆分分配变量

ID | Semantic Tags 

1 | Person:mitch mcconnell, Person:ashley judd, Position:senator 

2 | Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics 

3 | Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican 

4 | Person:ashley judd, topicname:politics 

5 | URL:www.huffingtonpost.com, Company:usa today, Person:chuck todd, Company:msnbc

我想每个标签类型（冒号前的术语）&标签（冒号后的术语）分成两个独立的领域：“标签类型” &“标签”。最终的文件应该是这个样子：

ID | Tag Type | Tag 

1 | Person | mitch McConnell 

1 | Person | ashley judd 

1 | Position | senator 

2 | Person | mitch McConnell 

2 | Position | senator 

2 | State | kentucky

这里是我到目前为止的代码...

但在那之后，我迷路了！我相信我需要使用lapply或sapply为此，但不知道在哪里播放...

我的道歉，如果这已被回答在网站上的其他形式 - 我是新来的R &这是对我来说仍然有点复杂。

在此先感谢任何人的帮助。

来源

2013-04-09 NiuBiBang

能否请您使用'dput（emtable）提供了一个可重复的例子'（或'dput （head（emtable））'如果这是太多的数据？） – 2013-04-09 15:03:42

我已经重新格式化数据，看起来像他们的表格布局。 – NiuBiBang 2013-04-09 15:18:27

你为什么不使用'dput'？它使回答者更容易 – 2013-04-09 15:21:40

这是另一种（略有不同）的方法：

## dat <- readLines(n=5) 
## Person:mitch mcconnell, Person:ashley judd, Position:senator 
## Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics 
## Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican 
## Person:ashley judd, topicname:politics 
## URL:www.huffingtonpost.com, URL:http://www.regular-expressions.info 

dat3 <- lapply(strsplit(dat, ","), function(x) gsub("^\\s+|\\s+$", "", x)) 
#dat3 <- lapply(dat2, function(x) x[grepl("Person|Position", x)]) 
dat3 <- lapply(dat3, strsplit, ":(?!/)", perl=TRUE) #break on : not folled by/
dat3 <- data.frame(ID=rep(seq_along(dat3), sapply(dat3, length)), 
    do.call(rbind, lapply(dat3, function(x) do.call(rbind, x))) 
) 

colnames(dat3)[-1] <- c("Tag Type", "Tag") 

## ID  Tag Type     Tag 
## 1 1   Person  mitch mcconnell 
## 2 1   Person   ashley judd 
## 3 1  Position    senator 
## 4 2   Person  mitch mcconnell 
## 5 2  Position    senator 
## 6 2 ProvinceOrState    kentucky 
## 7 2  topicname    politics 
## 8 3   Person  mitch mcconnell 
## 9 3   Person   ashley judd 
## 10 3 Organization     senate 
## 11 3 Organization    republican 
## 12 4   Person   ashley judd 
## 13 4  topicname    politics 
## 14 5    URL www.huffingtonpost.com 
## 15 5   Company    usa today 
## 16 5   Person    chuck todd 
## 17 5   Company     msnbc

详尽的解释：

## dat <- readLines(n=5) 
## Person:mitch mcconnell, Person:ashley judd, Position:senator 
## Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics 
## Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican 
## Person:ashley judd, topicname:politics 
## URL:www.huffingtonpost.com, URL:http://www.regular-expressions.info 

dat3 <- lapply(strsplit(dat, ","), function(x) gsub("^\\s+|\\s+$", "", x)) 
#dat3 <- lapply(dat2, function(x) x[grepl("Person|Position", x)]) 
dat3 <- lapply(dat3, strsplit, ":(?!/)", perl=TRUE) #break on : not folled by/

# Let the explanation begin... 

# Here I have a short list of the variables from the rows 
# of the original dataframe; in this case the row numbers: 

seq_along(dat3)  #row variables 

# then I use sapply and length to figure out hoe long the 
# split variables in each row (now a list) are 

sapply(dat3, length) #n times 

# this tells me how many times to repeat the short list of 
# variables. This is because I stretch the dat3 list to a vector 
# Here I rep the row variables n times 

rep(seq_along(dat3), sapply(dat3, length)) 

# better assign that for later: 

ID <- rep(seq_along(dat3), sapply(dat3, length)) 

#============================================ 
# Now to explain the next chunk... 
# I take dat3 

dat3 

# Each element in the list 1-5 is made of a new list of 
# Vectors of length 2 of Tag_Types and Tags. 
# For instance here's element 5 a list of two lists 
# with character vectors of length 2 

## [[5]] 
## [[5]][[1]] 
## [1] "URL" "www.huffingtonpost.com" 
## 
## [[5]][[2]] 
## [1] "URL" "http://www.regular-expressions.info" 

# Use str to look at this structure: 

dat3[[5]] 
str(dat3[[5]]) 

## List of 2 
## $ : chr [1:2] "URL" "www.huffingtonpost.com" 
## $ : chr [1:2] "URL" "http://www.regular-expressions.info" 

# I use lapply (list apply) to apply an anynomous function: 
# function(x) do.call(rbind, x) 
# 
# TO each of the 5 elements. This basically glues the list 
# of vectors together to make a matrix. Observe just on elenet 5: 

do.call(rbind, dat3[[5]]) 

##  [,1] [,2]         
## [1,] "URL" "www.huffingtonpost.com"    
## [2,] "URL" "http://www.regular-expressions.info" 

# We use lapply to do that to all elements: 

lapply(dat3, function(x) do.call(rbind, x)) 

# We then use the do.call(rbind on this list and we have a 
# matrix 

do.call(rbind, lapply(dat3, function(x) do.call(rbind, x))) 

# Let's assign that for later: 

the_mat <- do.call(rbind, lapply(dat3, function(x) do.call(rbind, x))) 

#============================================  
# Now we put it all together with data.frame: 

data.frame(ID, the_mat)

来源

2013-04-09 15:29:14

这似乎是在做伎俩。但是，当我运行第三个命令时，我们需要执行以下命令：dlbly（lably，lapply（dat3，function（x），dlbl） do.call（rbind，X））））' 我得到以下信息：错误函数（...，deparse.level = 1）：数矩阵的列必须匹配（见ARG 2 ）此外：有50条或更多警告（使用警告（）查看前50条） – NiuBiBang 2013-04-10 18:23:54

此问题仅针对您的数据，并不像您在此显示的数据。你可以使用debug这样的调试工具来找出第一个问题，第二个问题我会按照它的说法来做，并使用'warnings（）'来更具体地查看为什么你会得到你所做的警告。 – 2013-04-10 18:58:11

是的，我看到我的一个标签类型是URL，它经常包含“http：” - 最终在分割“：”时将矩阵分成非统一数量的列。所以我只是添加了一行代码来删除“http：”，b/n第一和第二strsplit代码。 – NiuBiBang 2013-04-14 01:36:55

DF 
## ID                     Semantic.Tags 
## 1 1         Person:mitch mcconnell, Person:ashley judd, Position:senator 
## 2 2  Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics 
## 3 3  Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican 
## 4 4               Person:ashley judd, topicname:politics 
## 5 5    URL:www.huffingtonpost.com, Company:usa today, Person:chuck todd, Company:msnbc 


ll <- lapply(strsplit(DF$Semantic.Tags, ","), strsplit, split = ":") 

f <- function(x) do.call(rbind, x) 

f(lapply(ll, f)) 
##  [,1]    [,2]      
## [1,] "  Person"  "mitch mcconnell"  
## [2,] " Person"   "ashley judd"   
## [3,] " Position"  "senator"    
## [4,] "  Person"  "mitch mcconnell"  
## [5,] " Position"  "senator"    
## [6,] " ProvinceOrState" "kentucky"    
## [7,] " topicname"  "politics "    
## [8,] "  Person"  "mitch mcconnell"  
## [9,] " Person"   "ashley judd"   
## [10,] " Organization" "senate"     
## [11,] " Organization" "republican "   
## [12,] "  Person"  "ashley judd"   
## [13,] " topicname"  "politics"    
## [14,] "  URL"   "www.huffingtonpost.com" 
## [15,] " Company"   "usa today"    
## [16,] " Person"   "chuck todd"    
## [17,] " Company"   "msnbc"

来源

2013-04-09 15:18:51

（+1）或者'matrix（rapply（ll，rbind），ncol = 2，byrow = TRUE）'最后两步。 – Henrik 2013-04-09 15:25:44

或更透明：'matrix（rapply（ll，identity），ncol = 2，byrow = TRUE）' – Henrik 2013-04-09 15:31:24

Thanks guys，我实际上使用了上述三种方法的代码组合。结束工作。 – NiuBiBang 2013-04-14 01:33:46

R：拆分字符串＆根据拆分分配变量

回答

相关问题