从页面上的按钮上刮下链接

我想从这个page上的“箱子得分”按钮上刮下链接。该按钮应该是这个样子从页面上的按钮上刮下链接

http://www.espn.com/nfl/boxscore?gameId=400874795

我试图用这个代码，看看我是否能访问按钮，但我不能。

from bs4 import BeautifulSoup 
import requests 

url = 'http://www.espn.com/nfl/scoreboard/_/year/2016/seasontype/1/week/2' 

advanced = url 
r = requests.get(advanced) 
data = r.text 
soup = BeautifulSoup(data,"html.parser") 

for link in soup.find_all('a'): 
    print link

来源

2017-08-02 jhaywoo8

1）下载并检查页面的原始HTML; 2）找到你想要刮的元素; 3）编写Python代码搜索这些元素; 4）??? 5）利润！ – ForceBru

这里的问题在于，您从网址获取的html实际上并不是您在浏览器中查看时看到的页面。有很多Ajax调用来填充页面，所以当您发出初始请求时，该数据还没有存在 – wpercy

这里是我所做的解决方案，它会删除您在答案中提供的url上的所有链接。你可以检查出来

# from BeautifulSoup import * 
from bs4 import BeautifulSoup 
# import requests 
import urllib 
url = 'http://www.espn.com/nfl/scoreboard/_/year/2016/seasontype/1/week/2' 

# advanced = url 
html = urllib.urlopen(url).read() 
# r = requests.get(html) 
# data = r.text 
soup = BeautifulSoup(html) 

tags = soup('a') 

# for link in soup.find_all('a'): 
for i,tag in enumerate(tags): 
    # print tag; 
    print i; 
    ans = tag.get('href',None) 
    print ans; 
    print "\n";

来源

2017-08-02 18:15:34

这并没有从“box score”按钮中获得链接。那是我需要的 – jhaywoo8

由于wpercy提到了他的意见，你不能做到这一点使用requests，作为一个建议，你应该Chromedriver/PhantomJSselenium一起使用，用于处理JavaScript的：

所有得分按钮的a标签具有属性name = &lpos=nfl:scoreboard:boxscore，所以我们先用.findAll现在一个简单的列表理解可以提取每个href属性：

>>> links = [box['href'] for box in boxList] 
>>> links 
['/nfl/boxscore?gameId=400874795', '/nfl/boxscore?gameId=400874854', '/nfl/boxscore?gameId=400874753', '/nfl/boxscore?gameId=400874757', '/nfl/boxscore?gameId=400874772', '/nfl/boxscore?gameId=400874777', '/nfl/boxscore?gameId=400874767', '/nfl/boxscore?gameId=400874812', '/nfl/boxscore?gameId=400874761', '/nfl/boxscore?gameId=400874764', '/nfl/boxscore?gameId=400874781', '/nfl/boxscore?gameId=400874796', '/nfl/boxscore?gameId=400874750', '/nfl/boxscore?gameId=400873867', '/nfl/boxscore?gameId=400874775', '/nfl/boxscore?gameId=400874798']

来源

2017-08-02 23:31:24

从页面上的按钮上刮下链接

回答

相关问题