提取结构松散的维基百科文本。 html

一些wikipedia消歧页面上的html是，我们应该说是模棱两可的，也就是说，连接到名为Corzine的特定人员的链接很难使用jsoup捕捉，因为它们没有明确的结构，也没有生活在特定部分，如this example。请参阅页面Corzine page here。提取结构松散的维基百科文本。 html

我怎样才能得到他们？ jsoup是这个任务的合适工具吗？

也许我应该使用正则表达式，但我害怕这样做，因为我希望它是可泛化的。

</b> may refer to:</p> 
<ul> 
    <li><a href

^这里是标准的，也许我可以使用正则表达式来匹配呢？

<p><b>Corzine</b> may refer to:</p> 
<ul> 
    <li><a href="/wiki/Dave_Corzine" title="Dave Corzine">Dave Corzine</a> (born 1956), basketball player</li> 
    <li><a href="/wiki/Jon_Corzine" title="Jon Corzine">Jon Corzine</a> (born 1947), former CEO of <a href="/wiki/MF_Global" title="MF Global">MF Global</a>, former Governor on New Jersey, former CEO of <a href="/wiki/Goldman_Sachs" title="Goldman Sachs">Goldman Sachs</a></li> 
</ul> 
<table id="setindexbox" class="metadata plainlinks dmbox dmbox-setindex" style="" role="presentation">

理想的产出将是

Dave Corzine 
Jon Corzine

也许这将有可能匹配部分</b> may refer to:</p>也<table id="setindexbox"并提取之间所有在不在。我猜<table id="setindexbox"可以在jsoup很容易匹配，但</b> may refer to:</p>应该是更difficule，因为<b>或<p>不是很区别。

我尝试这样做：

 Elements table = docx.select("ul"); 
     Elements links = table.select("li"); 



    Pattern ppp = Pattern.compile("table id=\"setindexbox\" "); 
    Matcher mmm = ppp.matcher(inputLine); 

    Pattern pp = Pattern.compile("</b> may refer to:</p>"); 
    Matcher mm = pp.matcher(inputLine); 
    if (mm.matches()) 
    { 
    while(!mmm.matches()) 
     for (Element link: links) 
     { 
      String url = link.attr("href"); 
      String text = link.text(); 
      System.out.println(text + ", " + url); 
     } 
    }

，但没有奏效。

来源

2015-04-23 s.matthew.english

你尝试在Pattern.compile中使用'\\ /'转义正则表达式字符'/'吗？ –

是的，但现在我有一个不同的想法，也许像[本页]（http://try.jsoup.org/~84yabFwFgJK4VojAbVK3TG-L2oA）我可以使用'ul a'，但确保它有'Corzine'，或无论我正在考虑的名称，链接的标题？你对此有何看法？也许这会更普遍化？但我不知道如何使其完全正常工作 –

通常，您应该使用[Wikipedia API]（http://www.mediawiki.org/wiki/API:Main_page）而不是抓取。 – JonasCz

这个选择的工作原理：

Elements els = doc.select("p ~ ul a:eq(0)");

参见：http://try.jsoup.org/~yPvgR0pxvA3oWQSJte4Rfm-lS2Y

这是寻找在ul第一A元素（a:eq(0)）这是一个p的兄弟姐妹。如果还有其他冲突，你也可以做p:contains(corzine) ~ ul a:eq(0)。

或者更一般：:contains(may refer to) ~ ul a:eq(0)

很难一概而论维基百科，因为它是非结构化的。但恕我直言，它更容易使用解析器和CSS选择器比正则表达式，特别是随着时间的推移，当模板更改等。

来源

2015-04-23 03:03:59

提取结构松散的维基百科文本。 html

回答

相关问题