2011-11-12 46 views
-7

可能重复:
My regex is not working properly如何避免之间的文本{{}}

假设我有很长的文字。从下面的文字我只需要抽象部分。如何避免{{ }}之间的文字。由于 `

{{ Info extra text}} 
{{Infobox film 
| name   = Papori 
| released  = 1986 
| runtime  = 144 minutes 
| country  = Assam, {{IND}} 
| budget   = [[a]] 
| followed by = free 
}} 
Albert Einstein (/'ælb?rt 'a?nsta?n/; German: ['alb?t 'a?n?ta?n] (listen); 14 March 1879 – 18 April 1955) 
was a German-born theoretical physicist who developed the theory of general relativity, effecting a 
revolution in physics. For this achievement, Einstein is often regarded as the father of modern physics 
and one of the most prolific intellects in human history.` 

OUTPUT:

Albert Einstein (/'ælb?rt 'a?nsta?n/; German: ['alb?t 'a?n?ta?n] (listen); 14 March 1879 – 18 April 1955) 
was a German-born theoretical physicist who developed the theory of general relativity, effecting a 
revolution in physics. For this achievement, Einstein is often regarded as the father of modern physics 
and one of the most prolific intellects in human history. 
+0

如果你真的* *只是询问如何维基百科的文章得到摘要,请注意,在[DBpedia中(http://dbpedia.org/罚款乡亲页面/ Albert_Einstein)使维基百科文章以结构化的方式可用(并且还处理wiki标记)。 –

+0

@John Flatness DBpedia是否提供'API'? –

+1

重复的[我的正则表达式工作不正常](http://stackoverflow.com/questions/8029633/my-regex-is-not-working-properly)和[关于正则表达式蟒蛇](http://stackoverflow.com/questions/8028729/rearding-regex-python) – agf

回答

1

我做了什么:

>>> text 
"{{ Info extra text}}\n{{Infobox film\n| name   = Papori\n| released  = 1986\n| runtime  = 144 minutes\n| country  = Assam, {{IND}}\n| budget   = [[a]]\n| followed by = free\n}}\nAlbert Einstein (/'ælb?rt 'a?nsta?n/; German: ['alb?t 'a?n?ta?n] (listen); 14 March 1879 – 18 April 1955)\n was a German-born theoretical physicist who developed the theory of general relativity, effecting a\n revolution in physics. For this achievement, Einstein is often regarded as the father of modern physics \n and one of the most prolific intellects in human history.`" 
>>> re.sub(r"\{\{[\w\W\n\s]*\}\}", "", text) 
"\nAlbert Einstein (/'ælb?rt 'a?nsta?n/; German: ['alb?t 'a?n?ta?n] (listen); 14 March 1879 – 18 April 1955)\n was a German-born theoretical physicist who developed the theory of general relativity, effecting a\n revolution in physics. For this achievement, Einstein is often regarded as the father of modern physics \n and one of the most prolific intellects in human history.`" 

编辑:Bart的评论是正确的。

可能会考虑这个选择:

>>> re.sub(r"\{\{[^\}]*\}\}", "", "{{a\n oaheduh}} b {{c}} d") 
' b d' 
+1

匹配第一个'{{',然后消耗所有东西直到最后一个'}}'。这可能适用于OP发布的(单个)示例,但也会删除'“{{a}} b {{c}}”'中的''b''。 –

+0

另外,你可以从'[\ w \ W \ n \ s]'中删除'\ n \ s',这些集合已经被'\ W'匹配。 –

相关问题