维基百科第一段

我正在写一些Java代码，以便在使用维基百科的文本中实现NLP任务。我如何使用JSoup来提取维基百科文章的第一段？维基百科第一段

非常感谢。

2011-11-27 Lida

这非常简单，并且对于从中提取信息的每个半结构化页面而言，该过程都非常相似。

首先，你必须唯一标识DOM元素，其中所需要的信息就在于要做到这一点是使用Web开发工具最简单的方法，如Firebug在Firefox或附带捆绑的那些IE（> 6，我认为）和Chrome。

使用文章Potato作为一个例子，你会发现，<p> aragraph你感兴趣的是，在以下块：

<div class="mw-content-ltr" lang="en" dir="ltr"> 
    <div class="metadata topicon" id="protected-icon" style="display: none; right: 55px;">[...]</div> 
    <div class="dablink">[...]</div> 
    <div class="dablink">[...]</div> 
    <div>[...]</div> 
    <p>The potato [...]</p> 
    <p>[...]</p> 
    <p>[...]</p>

换句话说，你想找到的第一个<p>元素在div之内，class称为mw-content-ltr。

然后，您只需要选择与jsoup该元素，例如使用其选择的语法（这是非常类似jQuery的）：

public class WikipediaParser { 
    private final String baseUrl; 

    public WikipediaParser(String lang) { 
    this.baseUrl = String.format("http://%s.wikipedia.org/wiki/", lang); 
    } 

    public String fetchFirstParagraph(String article) throws IOException { 
    String url = baseUrl + article; 
    Document doc = Jsoup.connect(url).get(); 
    Elements paragraphs = doc.select(".mw-content-ltr p"); 

    Element firstParagraph = paragraphs.first(); 
    return firstParagraph.text(); 
    } 

    public static void main(String[] args) throws IOException { 
    WikipediaParser parser = new WikipediaParser("en"); 
    String firstParagraph = parser.fetchFirstParagraph("Potato"); 
    System.out.println(firstParagraph); // prints "The potato is a starchy [...]." 
    } 
}

来源

2011-11-27 16:41:50

你好，非常感谢你的确。建议的解决方案完美运作。 – Lida

这似乎是第一款也是第一<p>块在文件中。所以这可能工作：

Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/B-tree").get(); 
Elements paragraphs = doc.select("p"); 
Element firstParagraph = paragraphs.first();

现在你可以得到这个元素

来源

2011-11-27 16:42:49 hage

'getElementsByClass（）'按类名返回元素，而不是按标签名称。 – BalusC

@BalusC哦，是的，你说得对。我更新了我的答案。 – hage

席尔瓦提出的解决方案中的“JavaScript”和“United States”适用于大多数情况下，除了喜欢的内容。段落应选为doc.select（“。mw-body-content p”）;

检查this GitHub代码的更多细节。您还可以从HTML中删除一些元数据信息以提高准确性。

来源

2016-07-13 22:13:49

维基百科第一段

回答

相关问题