无法抓取Reddit的NBA页面

我是网络爬虫的新手，想要学习如何使用beautifulsoup将其集成到迷你项目中。我在他的youtube channel上关注美丽的新教程，然后就试图抓取Reddit。我想在Reddit/r/nba的每个NBA新闻中抓取冠军和链接，但没有取得任何成功。只有在终端返回的是“处理完成退出码0”。我有一种感觉，这是与我的选择？任何指导和帮助将不胜感激。无法抓取Reddit的NBA页面

这是原来的代码，没有工作：

import requests 
from bs4 import BeautifulSoup 

def spider(max_pages): 
    page = 1 
    while page <= max_pages: 
     url = 'https://reddit.com/r/nba' + str(page) 
     source_code = requests.get(url) 
     plain_text = source_code.text 
     soup = BeautifulSoup(plain_text, "html.parser") 
     for link in soup.find_all('a', {'class': 'title'}): 
      href = link.get('href') 
      print(href) 
     page += 1 

spider(1)

我试图做这样但这并没有解决问题：

import requests 
from bs4 import BeautifulSoup 

def spider(max_pages): 
    page = 1 
    while page <= max_pages: 
     url = 'https://www.reddit.com/r/nba/' + str(page) 
     source_code = requests.get(url) 
     plain_text = source_code.text 
     soup = BeautifulSoup(plain_text, "html.parser") 
     for link in soup.findAll('a', {'class': 'title'}): 
      href = "https://www.reddit.com/" + link.get('href') 
      title = link.string 
      print(href) 
      print(title) 
     page += 1 

spider(1)

来源

2017-10-18 Vincent

您是否检查过请求返回的内容，您可能需要更改您的用户代理字符串以避免bot块。 –

当我运行应用程序时，它只是说“处理已用退出代码0完成” – Vincent

检查plain_text的值。 URL模式也是错误的。 –

获取主网页的标题和链接：

from bs4 import BeautifulSoup 
from urllib.request import urlopen 

html = urlopen("https://www.reddit.com/r/nba/") 
soup = BeautifulSoup(html, 'lxml') 
for link in soup.find('div', {'class':'content'}).find_all('a', {'class':'title may-blank outbound'}): 
    print(link.attrs['href'], link.get_text())

来源

2017-10-18 07:29:22 komito

这对我并不适用 – Vincent

无法抓取Reddit的NBA页面

回答

相关问题