Web刮每个论坛帖子（Python，Beautifulsoup）

你好再次同胞堆栈。简短的描述..我网站上使用Python从汽车论坛中抓取一些数据并将所有数据保存到CSV文件中。在其他一些stackoverflow成员的帮助下，尽可能的挖掘所有页面的特定主题，收集每篇文章的日期，标题和链接。（对于每个发现的链接，python都会为它创建一个新的汤，通过所有的帖子进行刮擦，然后返回到前一个链接）。我也有一个单独的脚本。Web刮每个论坛帖子（Python，Beautifulsoup）

真的很感谢任何其他提示或建议，因为这是我第一次使用python，我认为这可能是我的嵌套循环逻辑搞砸了，但通过多次检查似乎是对的。

继承人的代码片段：

 link += (div.get('href')) 
     savedData += "\n" + title + ", " + link 
     tempSoup = make_soup('http://www.automotiveforums.com/vbulletin/' + link) 
     while tempNumber < 3: 
      for tempRow in tempSoup.find_all(id=re.compile("^td_post_")): 
       for tempNext in tempSoup.find_all(title=re.compile("^Next Page -")): 
        tempNextPage = "" 
        tempNextPage += (tempNext.get('href')) 
       post = "" 
       post += tempRow.get_text(strip=True) 
       postData += post + "\n" 
      tempNumber += 1 
      tempNewUrl = "http://www.automotiveforums.com/vbulletin/" + tempNextPage 
      tempSoup = make_soup(tempNewUrl) 
      print(tempNewUrl) 
    tempNumber = 1 
    number += 1 
    print(number) 
    newUrl = "http://www.automotiveforums.com/vbulletin/" + nextPage 
    soup = make_soup(newUrl)

我的主要问题与它到目前为止是tempSoup = make_soup('http://www.automotiveforums.com/vbulletin/' + link) 似乎并没有创建一个新的汤它做刮论坛话题的所有帖子后。

这是我得到的输出：

http://www.automotiveforums.com/vbulletin/showthread.php?s=6a2caa2b46531be10e8b1c4acb848776&t=1139532&page=2 
    http://www.automotiveforums.com/vbulletin/showthread.php?s=6a2caa2b46531be10e8b1c4acb848776&t=1139532&page=3 
    1

所以它似乎找到了新页面的正确链接和刮他们，但是明年itteration它打印新的日期和完全相同的网页。还有一个怪异的10-12秒延迟后，最后一个链接打印，然后它跳到打印数字1，然后击出所有新的日期..

但是去下一个论坛线程链接后，它每次都会抓取相同的确切数据。

对不起，如果看起来很杂乱，这是一个侧面项目，我第一次尝试做一些有用的事情，所以我很新，在此，任何建议或提示将不胜感激。我并没有要求你为我解决代码，即使我可能错误的逻辑的一些指针，将不胜感激！

亲切的问候，感谢您阅读这么烦人的帖子！

编辑：我剪了大部分职位/代码片段的，因为我相信人们的生活越来越不堪重负。刚刚离开了我正在努力的基本位。任何帮助将非常感激！

来源

2017-03-02 Norbis

所以花了一点点时间后，我设法几乎破解它。现在，python发现每一个线程，并在论坛上链接，然后进入每个链接，读取所有页面，并继续下一个链接。

这是它的固定代码，如果任何人将使用它。

link += (div.get('href')) 
    savedData += "\n" + title + ", " + link 
    soup3 = make_soup('http://www.automotiveforums.com/vbulletin/' + link) 
    while tempNumber < 4: 
     for postScrape in soup3.find_all(id=re.compile("^td_post_")): 
      post = "" 
      post += postScrape.get_text(strip=True) 
      postData += post + "\n" 
      print(post) 
     for tempNext in soup3.find_all(title=re.compile("^Next Page -")): 
      tempNextPage = "" 
      tempNextPage += (tempNext.get('href')) 
      print(tempNextPage) 
     soup3 = "" 
     soup3 = make_soup('http://www.automotiveforums.com/vbulletin/' + tempNextPage) 
     tempNumber += 1 
    tempNumber = 1 
number += 1 
print(number) 
newUrl = "http://www.automotiveforums.com/vbulletin/" + nextPage 
soup = make_soup(newUrl)

我所要做的就是将2个嵌套在一起的循环放入自己的循环中。仍然不是一个完美的解决方案，但嘿，它几乎可以工作。

非工作位：提供的链接的前2个线程有多个帖子页面。以下10多个线程不要。我找不到一种方法来检查循环外部的值，看看它是否为空。因为如果它没有找到下一个页面元素/ href，它只会使用最后一个。但是如果我在每次运行后重置该值，它不会再挖掘每个页面= l一个解决方案刚刚创建了另一个问题：D。

来源

2017-03-05 02:57:49 Norbis

Web刮每个论坛帖子（Python，Beautifulsoup）

回答

相关问题