2017-08-18 42 views
2

当前我正在开发一个程序,该程序允许我收集添加到我的Ao3(Archive of Our Own)粉丝群中的最近5个小说故事。然后将这些故事添加到我设置的ArrayList中,该列表将在过去一周内保存小说作品。在每周结束时,我计划将ArrayList的内容转储到一个文本文件中,以便将其粘贴到我的subreddit的Reddit帖子中。现在,为了防止重复,我想比较新解析的故事与当前在ArrayList中保存的故事。使用jsoup从特定标签之间的网页中抓取数据

(附加信息:该机器人将每隔30分钟检查网页),我已经渐渐赶上了上

的部分是网页的实际分析和充分利用HTML标签之间的内容。

我抬头看CSS选择器,但我仍然感到十分困惑,因为几乎每个例子都来自像IMBD这样简单的网站。

从基础研究来看,它看起来像在我正在看的主体内,故事全都在一个有序列表标记内。

<o1 class="work index group"> 
    <li class="work blurb group" id="work_10504812" role="article>...</li> 
    <li class="work blurb group" id="work_9656693" role="article>...</li> 
    <li class="work blurb group" id="work_11814486" role="article>...</li> 
    //Goes on for ~20 more stories 
    <li class="work blurb group" id="work_11687247" role="article>...</li> 
</ol> 

因此,为了清楚起见,每个列表类型都是位于有序列表中的单个故事。在一个列表标签内的任何内容如下。 (添加有序列表标签的情况下)

<ol class="work index group"> 
    <li class="work blurb group" id="work_10504812" role="article"> 
    <!--title, author, fandom--> 
    <div class="header module"> 
    <h4 class="heading"> 
     <a href="/works/10504812">Pocket Healer</a> 
     by 

     <!-- do not cache --> 
     <a rel="author" href="https://stackoverflow.com/users/OverNoot/pseuds/OverNoot">OverNoot</a> 
    </h4> 
    <h5 class="fandoms heading"> 
     <span class="landmark">Fandoms:</span> 
     <a class="tag" href="/tags/Overwatch%20(Video%20Game)/works">Overwatch (Video Game)</a> 
     &nbsp; 
    </h5> 
    <!--required tags--> 
    <ul class="required-tags"> 
<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="rating-general-audience rating" title="General Audiences"><span class="text">General Audiences</span></span></a></li> 
<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="warning-no warnings" title="No Archive Warnings Apply"><span class="text">No Archive Warnings Apply</span></span></a></li> 
<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="category-femslash category" title="F/F"><span class="text">F/F</span></span></a></li> 
<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="complete-no iswip" title="Work in Progress"><span class="text">Work in Progress</span></span></a></li> 
</ul> 
    <p class="datetime">17 Aug 2017</p> 
    </div> 
    <!--warnings again, cast, freeform tags--> 
    <h6 class="landmark heading">Tags</h6> 
    <ul class="tags commas"> 
    <li class="warnings"><strong><a class="tag" href="/tags/No%20Archive%20Warnings%20Apply/works">No Archive Warnings Apply</a></strong></li><li class="relationships"><a class="tag" href="/tags/Fareeha%20%22Pharah%22%20Amari*s*Angela%20%22Mercy%22%20Ziegler/works">Fareeha "Pharah" Amari/Angela "Mercy" Ziegler</a></li><li class="characters"><a class="tag" href="/tags/Fareeha%20%22Pharah%22%20Amari/works">Fareeha "Pharah" Amari</a></li> <li class="characters"><a class="tag" href="/tags/Angela%20%22Mercy%22%20Ziegler/works">Angela "Mercy" Ziegler</a></li> <li class="characters"><a class="tag" href="/tags/Winston%20(Overwatch)/works">Winston (Overwatch)</a></li> <li class="characters"><a class="tag" href="/tags/Lena%20%22Tracer%22%20Oxton/works">Lena "Tracer" Oxton</a></li><li class="freeforms"><a class="tag" href="/tags/Tiny%20Pharah%20and%20Tiny%20Mercy/works">Tiny Pharah and Tiny Mercy</a></li> <li class="freeforms"><a class="tag" href="/tags/Fluff/works">Fluff</a></li> <li class="freeforms last"><a class="tag" href="/tags/Cute/works">Cute</a></li> 
    </ul> 
    <!--summary--> 
    <h6 class="landmark heading">Summary</h6> 
    <blockquote class="userstuff summary"> 
     <p>Angela and Fareeha wake up to find tiny alternate versions of themselves have appeared and are now imprinted on them. How will these tiny Pharahs and Mercies impact their work at Overwatch and more importantly how will it impact the feelings they have for each other.</p> 
    </blockquote> 
    <!--stats--> 

    <dl class="stats"> 
     <dt class="language">Language:</dt> 
     <dd class="language">English</dd> 
    <dt class="words">Words:</dt> 
    <dd class="words">35,143</dd> 
    <dt class="chapters">Chapters:</dt> 
    <dd class="chapters">10/11</dd> 
    <dt class="comments">Comments:</dt> 
    <dd class="comments"><a href="/works/10504812?show_comments=true&amp;view_full_work=true#comments">168</a></dd> 
    <dt class="kudos">Kudos:</dt> 
    <dd class="kudos"><a href="/works/10504812?view_full_work=true#comments">438</a></dd> 
    <dt class="bookmarks">Bookmarks:</dt> 
    <dd class="bookmarks"><a href="/works/10504812/bookmarks">35</a></dd> 
    <dt class="hits">Hits:</dt> 
    <dd class="hits">5890</dd> 
    </dl> 
</li> 

,基本上我想提取标题,作者,URL,总结和评价。

到目前为止,我已经收集了我想要提取的物品的位置,但我没有真正的想法如何去做。

标题:

<a href="/works/10504812">Pocket Healer</a> 

作者:

<a rel="author" href="https://stackoverflow.com/users/OverNoot/pseuds/OverNoot">OverNoot</a> 

网址:

<li class="work blurb group" id="work_10504812" role="article"> 
<!--(http://archiveofourown.com/works/<the number after 'work_'>)--> 

摘要:

<blockquote class="userstuff summary"> 
    <p> (SUMMARY GOES HERE) </p> 
</blockquote> 

Rating:

<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="rating-general-audience rating" title="General Audiences"><span class="text">General Audiences</span></span></a></li> 

其他问题:是否有可能遍历有序列表的内容,如forloop?

我为打开网页设置的当前代码如下。

while (true) { 
     try { 

      String url = "http://archiveofourown.org/tags/Fareeha%20%22Pharah%22%20Amari*s*Angela%20%22Mercy%22%20Ziegler/works"; 
      Document doc = Jsoup.connect(url).get(); 

      //Returns element of webpage 
      doc.select("<Narrow down to ordered list>"); 

      //Run for loop to run through first 5 items of 
      Thread.sleep(THIRTY_MINUTES); 

     } 
     catch (Exception ex) { 
      ex.printStackTrace(); 
     } 

    } 

回答

0

您可以使用Document.select(String cssSelector)方法返回Elements,您可以迭代。例如,ol.work > li将返回所有li元素,这是第一级子元素到此ol.work元素。你可以用它遍历所有的故事。

考虑下面的代码部分:

Elements ol = doc.select("ol.work > li"); 

for (Element li : ol) { 
    String title = li.select("h4.heading a").first().text(); 
    String author = li.select("h4.heading a[rel=author]").text(); 
    String id = li.attr("id").replaceAll("work_",""); 
    String url = "http://archiveofourown.com/works/" + id; 
    String summary = li.select("blockquote.summary").text(); 
    String rating = li.select("span.rating").text(); 

    System.out.println("Title: " + title); 
    System.out.println("Author: " + author); 
    System.out.println("ID: " + id); 
    System.out.println("URL: " + url); 
    System.out.println("Summary: " + summary); 
    System.out.println("Rating: " + rating); 
} 

在这个例子中,我们得到的所有li元素在for循环和提取预期的内容。正如你所看到的,我们使用select方法对每个数据提取限制为当前的li元素。 Element.text()方法以纯文本的形式返回一个元素的主体,如果它们存在,则删除所有标签。

运行在与你把你的问题HTML代码将产生以下输出:

Title: Pocket Healer 
Author: OverNoot 
ID: 10504812 
URL: http://archiveofourown.com/works/10504812 
Summary: Angela and Fareeha wake up to find tiny alternate versions of themselves have appeared and are now imprinted on them. How will these tiny Pharahs and Mercies impact their work at Overwatch and more importantly how will it impact the feelings they have for each other. 
Rating: General Audiences 

我希望它能帮助。

+0

非常感谢您的帮助!那种东西让我很头疼,但它在我的代码中完美无瑕。非常感激! – Jayps

+0

@Jayps我很高兴我可以帮你:) –

+0

现在我想测试在c#中做同样的事情,我认为它会以同样的方式工作?我只需要找到另一个类似于jsoup的库(用于c#)? – Jayps