2009-06-20 39 views

回答

4

如果您的意思是“我只想获得wikitext”,那么请看wikipedia.Page类和get方法。

import wikipedia 

site = wikipedia.getSite('en', 'wikipedia') 
page = wikipedia.Page(site, 'Test') 

print page.get() # '''Test''', '''TEST''' or '''Tester''' may refer to: 
#==Science and technology== 
#* [[Concept inventory]] - an assessment to reveal student thinking on a topic. 
# ... 

这样您就可以从文章中获得完整的原始wiki文本。

如果要删除wiki语法,就像将[[Concept inventory]]转换为Concept库存等一样,这将会更加痛苦。

这个问题的主要原因是MediaWiki wiki语法没有定义的语法。这使得解析和剥离非常困难。我目前不知道哪种软件可以让你准确地做到这一点。当然有MediaWiki Parser类,但它是PHP,有点难以掌握,其目的非常不同。

但是,如果你只是想去掉链接,或非常简单的wiki结构使用正则表达式:

text = re.sub('\[\[([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor]] sit amet, consectetur adipiscing elit.') 
print text #Lorem ipsum dolor sit amet, consectetur adipiscing elit. 

,然后管道链接:

text = re.sub('\[\[(?:[^\]\|]*)\|([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor|DOLOR]] sit amet, consectetur adipiscing elit.') 
print text #Lorem ipsum DOLOR sit amet, consectetur adipiscing elit. 

等。

但例如,有一个从网页去掉嵌套模板,没有可靠的简便方法。对于在评论中有链接的图片也是如此。这非常困难,并涉及递归删除最内部的链接并用标记替换它并重新开始。如果需要,可以查看wikipedia.py中的templateWithParams函数,但这不太好。

+0

显然我误解了问题的范围。鉴于没有其他答案,我尽了最大的努力。 :-) – cdleary 2009-06-21 20:10:42

0

有一个名为​​模块,可以让你很接近你根据你需要什么想要什么。它有一个名为strip_code()的方法,它剥去了很多标记。

import pywikibot 
import mwparserfromhell 

test_wikipedia = pywikibot.Site('en', 'test') 
text = pywikibot.Page(test_wikipedia, 'Lestat_de_Lioncourt').get() 

full = mwparserfromhell.parse(text) 
stripped = full.strip_code() 

print full 
print '*******************' 
print stripped 

比较片段:

{{db-foreign}} 
<!-- Commented out because image was deleted: [[Image:lestat_tom_cruise.jpg|thumb|right|[[Tom Cruise]] as Lestat in the film ''[[Interview With The Vampire: The Vampire Chronicles]]''|{{deletable image-caption|1=Friday, 11 April 2008}}]] --> 

[[Image:lestat.jpg|thumb|right|[[Stuart Townsend]] as Lestat in the film ''[[Queen of the Damned (film)|Queen of the Damned]]'']] 

[[Image:Lestat IWTV.jpg|thumb|right|[[Tom Cruise]] as Lestat in the 1994 film ''[[Interview with the Vampire (film)|Interview with the Vampire]]'']] 

'''Lestat de Lioncourt''' is a [[fictional character]] appearing in several [[novel]]s by [[Anne Rice]], including ''[[The Vampire Lestat]]''. He is a [[vampire]] and the main character in the majority of ''[[The Vampire Chronicles]]'', narrated in first person. 

==Publication history== 
Lestat de Lioncourt is the narrator and main character of the majority of the novels in Anne Rice's ''The Vampire Chronicles'' series. ''[[The Vampire Lestat]]'', the second book in the series, is presented as Lestat's autobiography, and follows his exploits from his youth in France to his early years as a vampire. Many of the other books in the series are also credited as being written by Lestat. 


******************* 

thumb|right|Stuart Townsend as Lestat in the film ''Queen of the Damned'' 

'''Lestat de Lioncourt''' is a fictional character appearing in several novels by Anne Rice, including ''The Vampire Lestat''. He is a vampire and the main character in the majority of ''The Vampire Chronicles'', narrated in first person. 

Publication history 
Lestat de Lioncourt is the narrator and main character of the majority of the novels in Anne Rice's ''The Vampire Chronicles'' series. ''The Vampire Lestat'', the second book in the series, is presented as Lestat's autobiography, and follows his exploits from his youth in France to his early years as a vampire. Many of the other books in the series are also credited as being written by Lestat. 
相关问题