2014-01-17 119 views
1

我有一个名为“Section 1”...“Section 20”的几个部分的字符串,并且希望将这个字符串拆分为这些单独的部分。这里有一个例子:将字符串拆分为基于标题的部分

Stuff we don't care about 

Section 1 
Text within this section, may contain the word section. 

And go on for quite a bit. 

Section 15 
Another section 

我想这个分成

["Section 1\n Text within this section, may contain the word section.\n\nAnd go in for quite a bit.", 
"Section 15 Another section"] 

我感觉不得到它的权利相当愚蠢的。我的尝试总是捕捉一切。现在我有

/(Section.+\d+$[\s\S]+)/ 

但我无法从中得到贪婪。

+0

一旦遇到“第1部分”,是否要捕获其他所有内容?或者,你想忽略第20节之后的文字吗?您想要在部分,*总是*紧随其后的行,还是在段之间会有段落/空白行? –

+0

这个例子很清楚。他希望每个部分(标题+文本)都是数组。 – robertodecurnex

+0

有帮助吗? –

回答

0

在我看来,Regexp分裂文字如下:

/(?:\n\n|^)Section/ 

因此,代码为:

str = " 
Stuff we don't care about 

Section 1 
Text within this section, may contain the word section. 

And go on for quite a bit. 

Section 15 
Another section 
" 

newstr = str.split(/(?:\n\n|^)Section/, -1)[1..-1].map {|l| "Section " + l.strip } 
# => ["Section 1\nText within this section, may contain the word section.\n\nAnd go on for quite a bit.", "Section 15\nAnother section"] 
+0

对不起,每个部分中的文本更复杂,可能包含换行符等。我会更新它。 –

+0

@MattW。我已更新答案 –

0

你可以使用这个表达式:

(?m)(Section\s*\d+)(.*?\1)$ 

Live demo

+0

我无法正常工作。我在最后忽略了“另一部分”,并给出奇怪的比赛 – robertodecurnex

+0

@robertodecurnex你错了。 “另一部分”的意思是“第16部分”,例如,它虽然工作。 – revo

+0

不,刚拿了样本,并使用你的链接 - > http://www.rubular.com/r/euxXwqo03d – robertodecurnex

0

您可以使用scan与此正则表达式/Section\s\d+\n(?:.(?!Section\s\d+\n))*/m

string.scan(/Section\s\d+\n(?:.(?!Section\s\d+\n))*/m) 

Section\s\d+\n将匹配任何节头

(?:.(?!Section\s\d+\n))*将匹配任何东西,除了另一节头。

m将使点匹配换行符太

sample = <<SAMPLE 
Stuff we don't care about 

Section 1 
Text within this section, may contain the word section. 

And go on for quite a bit. 

Section 15 
Another section 
SAMPLE 

sample.scan(/Section\s\d+\n(?:.(?!Section\s\d+\n))*/m) 
#=> ["Section 1\nText within this section, may contain the word section.\n\nAnd go on for quite a bit.\n", "Section 15\nAnother section\n"] 
0

我认为最简单的办法是:

str = "Stuff we don't care about 

Section 1 
Text within this section, may contain the word section. 

And go on for quite a bit. 

Section 15 
Another section" 

str[/^Section 1.+/m] # => "Section 1\nText within this section, may contain the word section.\n\nAnd go on for quite a bit.\n\nSection 15\nAnother section" 

如果你在Section头破段,开始以同样的方式,然后取Enumerable的优势slice_before

str = "Stuff we don't care about 

Section 1 
Text within this section, may contain the word section. 

And go on for quite a bit. 

Section 15 
Another section" 

str[/^Section 1.+/m].split("\n").slice_before(/^Section \d+/m).map{ |a| a.join("\n") } 
# => ["Section 1\nText within this section, may contain the word section.\n\nAnd go on for quite a bit.\n", 
#  "Section 15\nAnother section"] 

slice_before文档说:

为每个分块元素创建一个枚举器。块的开始由模式和块定义。

+0

请注意,第一行右侧有逗号。示例中有2个元素。 – robertodecurnex

+0

这只会让你更容易。谢谢。 –