2013-04-09 158 views
0

我有一个语义标签字段&语义标签类型。每个标签类型/标签用逗号分隔,而每个标签类型&标签以冒号分隔(见下文)。R:拆分字符串&根据拆分分配变量

ID | Semantic Tags 

1 | Person:mitch mcconnell, Person:ashley judd, Position:senator 

2 | Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics 

3 | Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican 

4 | Person:ashley judd, topicname:politics 

5 | URL:www.huffingtonpost.com, Company:usa today, Person:chuck todd, Company:msnbc 

我想每个标签类型(冒号前的术语)&标签(冒号后的术语)分成两个独立的领域:“标签类型” &“标签”。最终的文件应该是这个样子:

ID | Tag Type | Tag 

1 | Person | mitch McConnell 

1 | Person | ashley judd 

1 | Position | senator 

2 | Person | mitch McConnell 

2 | Position | senator 

2 | State | kentucky 

这里是我到目前为止的代码...

​​

但在那之后,我迷路了!我相信我需要使用lapply或sapply为此,但不知道在哪里播放...

我的道歉,如果这已被回答在网站上的其他形式 - 我是新来的R &这是对我来说仍然有点复杂。

在此先感谢任何人的帮助。

+1

能否请您使用'dput(emtable)提供了一个可重复的例子'(或'dput (head(emtable))'如果这是太多的数据?) – 2013-04-09 15:03:42

+0

我已经重新格式化数据,看起来像他们的表格布局。 – NiuBiBang 2013-04-09 15:18:27

+0

你为什么不使用'dput'?它使回答者更容易 – 2013-04-09 15:21:40

回答

4

这是另一种(略有不同)的方法:

## dat <- readLines(n=5) 
## Person:mitch mcconnell, Person:ashley judd, Position:senator 
## Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics 
## Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican 
## Person:ashley judd, topicname:politics 
## URL:www.huffingtonpost.com, URL:http://www.regular-expressions.info 

dat3 <- lapply(strsplit(dat, ","), function(x) gsub("^\\s+|\\s+$", "", x)) 
#dat3 <- lapply(dat2, function(x) x[grepl("Person|Position", x)]) 
dat3 <- lapply(dat3, strsplit, ":(?!/)", perl=TRUE) #break on : not folled by/
dat3 <- data.frame(ID=rep(seq_along(dat3), sapply(dat3, length)), 
    do.call(rbind, lapply(dat3, function(x) do.call(rbind, x))) 
) 

colnames(dat3)[-1] <- c("Tag Type", "Tag") 

## ID  Tag Type     Tag 
## 1 1   Person  mitch mcconnell 
## 2 1   Person   ashley judd 
## 3 1  Position    senator 
## 4 2   Person  mitch mcconnell 
## 5 2  Position    senator 
## 6 2 ProvinceOrState    kentucky 
## 7 2  topicname    politics 
## 8 3   Person  mitch mcconnell 
## 9 3   Person   ashley judd 
## 10 3 Organization     senate 
## 11 3 Organization    republican 
## 12 4   Person   ashley judd 
## 13 4  topicname    politics 
## 14 5    URL www.huffingtonpost.com 
## 15 5   Company    usa today 
## 16 5   Person    chuck todd 
## 17 5   Company     msnbc 

详尽的解释:

## dat <- readLines(n=5) 
## Person:mitch mcconnell, Person:ashley judd, Position:senator 
## Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics 
## Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican 
## Person:ashley judd, topicname:politics 
## URL:www.huffingtonpost.com, URL:http://www.regular-expressions.info 

dat3 <- lapply(strsplit(dat, ","), function(x) gsub("^\\s+|\\s+$", "", x)) 
#dat3 <- lapply(dat2, function(x) x[grepl("Person|Position", x)]) 
dat3 <- lapply(dat3, strsplit, ":(?!/)", perl=TRUE) #break on : not folled by/

# Let the explanation begin... 

# Here I have a short list of the variables from the rows 
# of the original dataframe; in this case the row numbers: 

seq_along(dat3)  #row variables 

# then I use sapply and length to figure out hoe long the 
# split variables in each row (now a list) are 

sapply(dat3, length) #n times 

# this tells me how many times to repeat the short list of 
# variables. This is because I stretch the dat3 list to a vector 
# Here I rep the row variables n times 

rep(seq_along(dat3), sapply(dat3, length)) 

# better assign that for later: 

ID <- rep(seq_along(dat3), sapply(dat3, length)) 

#============================================ 
# Now to explain the next chunk... 
# I take dat3 

dat3 

# Each element in the list 1-5 is made of a new list of 
# Vectors of length 2 of Tag_Types and Tags. 
# For instance here's element 5 a list of two lists 
# with character vectors of length 2 

## [[5]] 
## [[5]][[1]] 
## [1] "URL" "www.huffingtonpost.com" 
## 
## [[5]][[2]] 
## [1] "URL" "http://www.regular-expressions.info" 

# Use str to look at this structure: 

dat3[[5]] 
str(dat3[[5]]) 

## List of 2 
## $ : chr [1:2] "URL" "www.huffingtonpost.com" 
## $ : chr [1:2] "URL" "http://www.regular-expressions.info" 

# I use lapply (list apply) to apply an anynomous function: 
# function(x) do.call(rbind, x) 
# 
# TO each of the 5 elements. This basically glues the list 
# of vectors together to make a matrix. Observe just on elenet 5: 

do.call(rbind, dat3[[5]]) 

##  [,1] [,2]         
## [1,] "URL" "www.huffingtonpost.com"    
## [2,] "URL" "http://www.regular-expressions.info" 

# We use lapply to do that to all elements: 

lapply(dat3, function(x) do.call(rbind, x)) 

# We then use the do.call(rbind on this list and we have a 
# matrix 

do.call(rbind, lapply(dat3, function(x) do.call(rbind, x))) 

# Let's assign that for later: 

the_mat <- do.call(rbind, lapply(dat3, function(x) do.call(rbind, x))) 

#============================================  
# Now we put it all together with data.frame: 

data.frame(ID, the_mat) 
+0

这似乎是在做伎俩。但是,当我运行第三个命令时,我们需要执行以下命令:dlbly(lably,lapply(dat3,function(x),dlbl) do.call(rbind,X))) )' 我得到以下信息: 错误函数(...,deparse.level = 1): 数矩阵的列必须匹配(见ARG 2 ) 此外:有50条或更多警告(使用警告()查看前50条) – NiuBiBang 2013-04-10 18:23:54

+0

此问题仅针对您的数据,并不像您在此显示的数据。你可以使用debug这样的调试工具来找出第一个问题,第二个问题我会按照它的说法来做,并使用'warnings()'来更具体地查看为什么你会得到你所做的警告。 – 2013-04-10 18:58:11

+0

是的,我看到我的一个标签类型是URL,它经常包含“http:” - 最终在分割“:”时将矩阵分成非统一数量的列。所以我只是添加了一行代码来删除“http:”,b/n第一和第二strsplit代码。 – NiuBiBang 2013-04-14 01:36:55

3
DF 
## ID                     Semantic.Tags 
## 1 1         Person:mitch mcconnell, Person:ashley judd, Position:senator 
## 2 2  Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics 
## 3 3  Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican 
## 4 4               Person:ashley judd, topicname:politics 
## 5 5    URL:www.huffingtonpost.com, Company:usa today, Person:chuck todd, Company:msnbc 


ll <- lapply(strsplit(DF$Semantic.Tags, ","), strsplit, split = ":") 

f <- function(x) do.call(rbind, x) 

f(lapply(ll, f)) 
##  [,1]    [,2]      
## [1,] "  Person"  "mitch mcconnell"  
## [2,] " Person"   "ashley judd"   
## [3,] " Position"  "senator"    
## [4,] "  Person"  "mitch mcconnell"  
## [5,] " Position"  "senator"    
## [6,] " ProvinceOrState" "kentucky"    
## [7,] " topicname"  "politics "    
## [8,] "  Person"  "mitch mcconnell"  
## [9,] " Person"   "ashley judd"   
## [10,] " Organization" "senate"     
## [11,] " Organization" "republican "   
## [12,] "  Person"  "ashley judd"   
## [13,] " topicname"  "politics"    
## [14,] "  URL"   "www.huffingtonpost.com" 
## [15,] " Company"   "usa today"    
## [16,] " Person"   "chuck todd"    
## [17,] " Company"   "msnbc"     
+0

(+1)或者'matrix(rapply(ll,rbind),ncol = 2,byrow = TRUE)'最后两步。 – Henrik 2013-04-09 15:25:44

+1

或更透明:'matrix(rapply(ll,identity),ncol = 2,byrow = TRUE)' – Henrik 2013-04-09 15:31:24

+0

Thanks guys,我实际上使用了上述三种方法的代码组合。结束工作。 – NiuBiBang 2013-04-14 01:33:46