2013-08-28 65 views
1

请帮我解决我的小型项目。将strsplit(...)textvectors拆分为R

有一个大的文本元素列表。每个元素都应该被分成一小段句子。每个小列表应该像原始文本元素一样,作为一个元素保存到相同位置('行')的初始大列表的新列中。

分解标准是"/$","und/KON","oder/KON"。这应该保留在新的小单元素的头部。

我试过用正则表达式如"/$|und/KON|oder/KON"和manny组合转义"$","|","/"。此外,我试图改变参数perl = TRUE,fixed = TRUEFALSE。每次我尝试注意都会发生。似乎|解释不正确。你建议如何解决这个问题?

library(stringr) # don't know if it's required 

# Input list to be splitted at each 
#  "/$", "und/KON", "oder/KON" 
#  but should keep the expression at the start of the next list element 
#  
#  Would be nice but not necessary: The small-list to be named after the ID in the first column 

> r <- list(ID=c(01, 02, 03), 
      elements=c("This should become my first small-list :/$. the first element ,/$, the second element ,/$, and the third element ./$.", 
         "This should become my second small-list :/$. Element eins und/KON Element zwei oder/KON Element drei ./$.", 
         "This should become my third small-list :/$. Element Alpha und/KON Element Beta oder/KON Element Gamma ./$.") 

# Would look something like 
r$small_lists <- sapply(r$elements ,function(x) as.list(strsplit(x,"/$|und/KON"|oder/KON", fixed=TRUE))) 
> r$small_lists 

$01 
[1] "This should become my first small-list " 
[2] ":/$. the first element " 
[3] ",/$, the second element " 
[4] ",/$, and the third element " 
[5] "./$." 

$02 
[1] "This should become my second small-list " 
[2] ":/$. Element eins " 
[3] "und/KON Element zwei " 
[4] "oder/KON Element drei" 
[5] "./$." 

$03 
[1] "This should become my third small-list " 
[2] ":/$. Element Alpha " 
[3] "und/KON Element Beta " 
[4] "oder/KON Element Gamma " 
[5] "./$." 

> class(r) 
[1] "list" 
> class(r$small_lists) 
[1] "list" 
+1

我没有看到一个问题在这里了。 – A5C1D2H2I1M1N2O1R2T1

+0

@AnandaMahto:对不起,谢谢,完成:) – alex

+0

谢谢!)为了让我更好的理解,你能解释一下''&^ \\ 1“'分别是什么'”^&*“'工作? – alex

回答

3

实际上,如果这是您希望的输出,您实际上会有比您指示的更多的分割模式。请注意,我的模式与您的模式不同。所有特殊字符都已被\\转义。

为了让事情易于管理,我将创建一个单独的要分割的模式向量,将它们粘贴到主模式中,搜索它们并通过一些您知道不会出现在您的文本,并分裂。

这里是我已经确定的“模式”:

Pattern <- c(":/\\$", ",/\\$", "\\./\\$", 
      "und/KON", "oder/KON") 

我们可以paste这些模式合力得到主模式。内部seppaste是用于匹配不同图案的管道符号。整个模式放在括号内(()),以便我们稍后参考。

Pattern <- paste("(", paste(Pattern, collapse = "|"), ")", sep = "") 

我们现在可以使用gsub的“前缀”添加到模式(这是什么\\1指)。我们需要这个前缀,因为你想保留所提到的表达式。

## Insert some text pattern you know doesn't occur in your text 
## Here, I've prepended the matched patterns with "^&*" 
## You now have something on which you can split 
strsplit(gsub(Pattern, "^&*\\1", r$elements), "^&*", fixed = TRUE) 
# [[1]] 
# [1] "This should become my first small-list " 
# [2] ":/$. the first element "     
# [3] ",/$, the second element "    
# [4] ",/$, and the third element "    
# [5] "./$."         
# 
# [[2]] 
# [1] "This should become my second small-list " 
# [2] ":/$. Element eins "      
# [3] "und/KON Element zwei "     
# [4] "oder/KON Element drei "     
# [5] "./$."          
# 
# [[3]] 
# [1] "This should become my third small-list " 
# [2] ":/$. Element Alpha "      
# [3] "und/KON Element Beta "     
# [4] "oder/KON Element Gamma "     
# [5] "./$." 

从上面继续,让你描述的命名列表:

out <- strsplit(gsub(Pattern, "^&*\\1", r$elements), "^&*", fixed = TRUE) 
setNames(lapply(out, `[`, -1), lapply(out, `[`, 1)) 
# $`This should become my first small-list ` 
# [1] ":/$. the first element "  
# [2] ",/$, the second element " 
# [3] ",/$, and the third element " 
# [4] "./$."      
# 
# $`This should become my second small-list ` 
# [1] ":/$. Element eins "  
# [2] "und/KON Element zwei " 
# [3] "oder/KON Element drei " 
# [4] "./$."     
# 
# $`This should become my third small-list ` 
# [1] ":/$. Element Alpha "  
# [2] "und/KON Element Beta " 
# [3] "oder/KON Element Gamma " 
# [4] "./$." 
+0

非常感谢。为了更好的理解,你能分别解释'\\ 1''部分是什么意思吗?他们是角色的随机连续剧吗?或者他们很重要吗? – alex

+0

@alex,那些是反向引用。正则表达式中的匹配可以分组在括号内('()')。第一个模式被反引用为'\\ 1',第二个模式被引用为'\\ 2',依此类推。在这里,我们只有一种模式,所以它是'\\ 1',应该保持这种状态。 – A5C1D2H2I1M1N2O1R2T1

+0

非常感谢! :) – alex