重塑基于从单个列

regexed我有一个包含鸣叫列表的数据表中使用Twitter的库抓取并希望得到与重塑基于从单个列

因此，例如注释鸣叫的列表中选择多个项和其他行的data.frame ，我开始：

tmp=data.frame(tweets=c("this tweet with #onehashtag","#two hashtags #here","no hashtags"),dummy=c('random','other','column')) 
> tmp 
         tweets dummy 
1 this tweet with #onehashtag random 
2   #two hashtags #here other 
3     no hashtags column

，并希望产生：

result=data.frame(tweets=c("this tweet with #onehashtag","#two hashtags #here","#two hashtags #here","no hashtags"),dummy=c('random','other','other','column'),tag=c('#onehashtag','#two','#here',NA)) 
> result 
         tweets dummy  tag 
1 this tweet with #onehashtag random #onehashtag 
2   #two hashtags #here other  #two 
3   #two hashtags #here other  #here 
4     no hashtags column  <NA>

我可以使用正则表达式：

library(stringr) 
str_extract_all("#two hashtags #here","#[a-zA-Z0-9]+")

来提取鸣叫标签到一个列表，可能使用类似：

tmp$tags=sapply(tmp$tweets,function(x) str_extract_all(x,'#[a-zA-Z0-9]+')) 
> tmp 
         tweets dummy  tags 
1 this tweet with #onehashtag random #onehashtag 
2   #two hashtags #here other #two, #here 
3     no hashtags column

但我缺少某处一招并不能看到如何使用这个作为基础创建重复的行...

来源

2012-02-20 psychemedia

使用和不使用标签的不同行的行为，所以如果你分开处理这些情况，你的代码将更容易理解。

像以前一样使用str_extract_all来获取标签。

tags <- str_extract_all(tmp$tweets, '#[a-zA-Z0-9]+')

（您也可以使用正则表达式快捷alnum让所有字母数字字符。'#[[:alnum:]]+'）

使用rep找出多少次重复每一行。

index <- rep.int(seq_len(nrow(tmp)), sapply(tags, length))

展开tmp使用该指数，并添加一个标签栏。

tagged <- tmp[index, ] 
tagged$tags <- unlist(tags)

没有标签的行应该出现一次（不是零次），并且在标签列中有NA。

has_no_tag <- sapply(tags, function(x) length(x) == 0L) 
not_tagged <- tmp[has_no_tag, ] 
not_tagged$tags <- NA

结合这两者。

all_data <- rbind(tagged, not_tagged)

来源

2012-02-20 11:34:13

首先让我们得到比赛：

matches <- gregexpr("#[a-zA-Z0-9]+",tmp$tweets) 
matches 
[[1]] 
[1] 17 
attr(,"match.length") 
[1] 11 

[[2]] 
[1] 1 15 
attr(,"match.length") 
[1] 4 5 

[[3]] 
[1] -1 
attr(,"match.length") 
[1] -1

现在，我们可以用它来从原来得到正确的行数：

rep(seq(matches),times=sapply(matches,length)) 
[1] 1 2 2 3 
tmp2 <- tmp[rep(seq(matches),times=sapply(matches,length)),]

现在使用火柴得到的起点和终点的位置：

starts <- unlist(matches) 
ends <- starts + unlist(sapply(matches,function(x) attr(x,"match.length"))) - 1

并使用substr提取：

tmp2$tag <- substr(tmp2$tweets,starts,ends) 
tmp2 
         tweets dummy   tag 
1 this tweet with #onehashtag random #onehashtag 
2   #two hashtags #here other  #two 
2.1   #two hashtags #here other  #here 
3     no hashtags column

来源

2012-02-20 11:20:49 James

重塑基于从单个列

回答

相关问题