i如何在R中分割多行文本？

我有一个输入文件有一个段落。我需要按照模式将段落分成两个分段。i如何在R中分割多行文本？

paragraph.xml

<Text> 
     This is first line. 
     This is second line. 
     \delemiter\new\one 
     This is third line. 
     This is fourth line. 
</Text>

R代码里面：

doc<-xmlTreeParse("paragraph.xml") 
top = xmlRoot(doc) 
text<-top[[1]]

我需要本段分成2个段落。

1款

This is first line. 
This is second line.

1款

This is third line. 
    This is fourth line.

我发现strsplit功能是非常有用的，但它永远不会分离的多行文字。

来源

2013-03-20 Manish

在嵌入式换行符，列表或向量长度之一这个'character' '字符'，还是您尚未阅读的文本文件？ – 2013-03-20 04:34:59

请修改您的问题以显示您的数据的确切结构（或一些示例数据）。例如，粘贴'dput（head（yourdata））'的结果。目前尚不清楚新线如何确定。 – Ben 2013-03-20 04:36:07

既然你有xml文件，最好使用XML包装设施。我看到你在这里开始使用它，你已经开始的连续性。

library(XML) 
doc <- xmlParse('paragraph.xml') ## equivalent xmlTreeParse (...,useInternalNodes =TRUE) 
## extract the text of the node Text 
mytext = xpathSApply(doc,'//Text/text()',xmlValue) 
## convert it to a list of lines using scan 
lines <- scan(text=mytext,sep='\n',what='character') 
## get the delimiter index 
delim <- which(lines == "\\delemiter\\new\\one") 
## get the 2 paragraphes 
p1 <- lines[seq(delim-1)] 
p2 <- lines[seq(delim+1,length(lines))]

然后你可以使用paste或write拿到段落结构，例如，使用write：

write(p1,"",sep='\n') 

This is first line. 
This is second line.

来源

2013-03-20 06:26:23 agstudy

我可以使用猫而不是起诉写函数来获得段落结构吗？ – Manish 2013-03-20 06:35:39

@ user15662当然是。用'cat'替换'write'。 – agstudy 2013-03-20 06:37:37

这是一种迂回的可能性，使用split,grepl和cumsum。

一些样本数据：

temp <- c("This is first line.", "This is second line.", 
      "\\delimiter\\new\\one", "This is third line.", 
      "This is fourth line.", "\\delimiter\\new\\one", 
      "This is fifth line") 
# [1] "This is first line." "This is second line." "\\delimiter\\new\\one" 
# [4] "This is third line." "This is fourth line." "\\delimiter\\new\\one" 
# [7] "This is fifth line"

使用split使用cumsum上grepl产生 “团” 之后：

temp1 <- split(temp, cumsum(grepl("delimiter", temp))) 
temp1 
# $`0` 
# [1] "This is first line." "This is second line." 
# 
# $`1` 
# [1] "\\delimiter\\new\\one" "This is third line." "This is fourth line." 
# 
# $`2` 
# [1] "\\delimiter\\new\\one" "This is fifth line"

如果进一步清理需要，这里有一个选项：

lapply(temp1, function(x) { 
    x[grep("delimiter", x)] <- NA 
    x[complete.cases(x)] 
}) 
# $`0` 
# [1] "This is first line." "This is second line." 
# 
# $`1` 
# [1] "This is third line." "This is fourth line." 
# 
# $`2` 
# [1] "This is fifth line"

来源

2013-03-20 04:58:52 A5C1D2H2I1M1N2O1R2T1

i如何在R中分割多行文本？

回答

相关问题