2013-03-20 30 views
1

我有一个输入文件有一个段落。我需要按照模式将段落分成两个分段。i如何在R中分割多行文本?

paragraph.xml

<Text> 
     This is first line. 
     This is second line. 
     \delemiter\new\one 
     This is third line. 
     This is fourth line. 
</Text> 

R代码里面:

doc<-xmlTreeParse("paragraph.xml") 
top = xmlRoot(doc) 
text<-top[[1]] 

我需要本段分成2个段落。

1款

This is first line. 
This is second line. 

1款

This is third line. 
    This is fourth line. 

我发现strsplit功能是非常有用的,但它永远不会分离的多行文字。

+0

在嵌入式换行符,列表或向量长度之一这个'character' '字符',还是您尚未阅读的文本文件? – 2013-03-20 04:34:59

+0

请修改您的问题以显示您的数据的确切结构(或一些示例数据)。例如,粘贴'dput(head(yourdata))'的结果。目前尚不清楚新线如何确定。 – Ben 2013-03-20 04:36:07

回答

2

既然你有xml文件,最好使用XML包装设施。我看到你在这里开始使用它,你已经开始的连续性。

library(XML) 
doc <- xmlParse('paragraph.xml') ## equivalent xmlTreeParse (...,useInternalNodes =TRUE) 
## extract the text of the node Text 
mytext = xpathSApply(doc,'//Text/text()',xmlValue) 
## convert it to a list of lines using scan 
lines <- scan(text=mytext,sep='\n',what='character') 
## get the delimiter index 
delim <- which(lines == "\\delemiter\\new\\one") 
## get the 2 paragraphes 
p1 <- lines[seq(delim-1)] 
p2 <- lines[seq(delim+1,length(lines))] 

然后你可以使用pastewrite拿到段落结构,例如,使用write

write(p1,"",sep='\n') 

This is first line. 
This is second line. 
+0

我可以使用猫而不是起诉写函数来获得段落结构吗? – Manish 2013-03-20 06:35:39

+0

@ user15662当然是。用'cat'替换'write'。 – agstudy 2013-03-20 06:37:37

1

这是一种迂回的可能性,使用split,greplcumsum

一些样本数据:

temp <- c("This is first line.", "This is second line.", 
      "\\delimiter\\new\\one", "This is third line.", 
      "This is fourth line.", "\\delimiter\\new\\one", 
      "This is fifth line") 
# [1] "This is first line." "This is second line." "\\delimiter\\new\\one" 
# [4] "This is third line." "This is fourth line." "\\delimiter\\new\\one" 
# [7] "This is fifth line" 

使用split使用cumsumgrepl产生 “团” 之后:

temp1 <- split(temp, cumsum(grepl("delimiter", temp))) 
temp1 
# $`0` 
# [1] "This is first line." "This is second line." 
# 
# $`1` 
# [1] "\\delimiter\\new\\one" "This is third line." "This is fourth line." 
# 
# $`2` 
# [1] "\\delimiter\\new\\one" "This is fifth line" 

如果进一步清理需要,这里有一个选项:

lapply(temp1, function(x) { 
    x[grep("delimiter", x)] <- NA 
    x[complete.cases(x)] 
}) 
# $`0` 
# [1] "This is first line." "This is second line." 
# 
# $`1` 
# [1] "This is third line." "This is fourth line." 
# 
# $`2` 
# [1] "This is fifth line"