Python/BeautifulSoup：检索'href'属性

我想从我刮的网站获取href属性。我的脚本：Python/BeautifulSoup：检索'href'属性

from bs4 import BeautifulSoup 
import requests 
import csv 


i = 1 
for i in range(1, 2, 1): 
    i = str(i) 
    baseurl = "https://www.quandoo.nl/amsterdam?page=" + i 
    r1 = requests.get(baseurl) 
    data = r1.text 
    soup = BeautifulSoup(data, "html.parser") 
    for link in soup.findAll('span', {'class', "merchant-title", 'itemprop', "name", 'a'}): 
     print link

返回如下：

<span class="merchant-title" itemprop="name"><a href="https://www.quandoo.nl/place/ristorante-due-napoletani-5644" itemprop="url">Ristorante Due Napoletani</a></span> 
<span class="merchant-title" itemprop="name"><a href="https://www.quandoo.nl/place/yamyam-4850" itemprop="url">YamYam</a></span> 
<span class="merchant-title" itemprop="name"><a href="https://www.quandoo.nl/place/the-golden-temple-5278" itemprop="url">The Golden Temple</a></span> 
<span class="merchant-title" itemprop="name"><a href="https://www.quandoo.nl/place/sampurna-4609" itemprop="url">Sampurna</a></span> 
<span class="merchant-title" itemprop="name"><a href="https://www.quandoo.nl/place/motto-sushi-25471" itemprop="url">Motto Sushi</a></span> 
<span class="merchant-title" itemprop="name"><a href="https://www.quandoo.nl/place/takumi-ya-8171" itemprop="url">Takumi-Ya</a></span> 
<span class="merchant-title" itemprop="name"><a href="https://www.quandoo.nl/place/casa-di-david-19167" itemprop="url">Casa di David</a></span>

（这是只是其中的一部分，我不想与整个输出轰炸你。）我没有问题拔出字符串与餐厅的名称，但我找不到配置给我只是href属性。对于我的当前配置，.strip（）方法似乎不可行。任何帮助都会很棒。

来源

2016-11-22 dtrinh

这可能有助于http://stackoverflow.com/a/5815888/5811078 – zipa

我得到这个错误类型错误：预期的字符串或缓冲区 – dtrinh

有你试着用'str（）'来转换它？ – zipa

与此代码尝试，它为我工作：

from bs4 import BeautifulSoup 
import requests 
import csv 

import re 


i = 1 
for i in range(1, 2, 1): 
    i = str(i) 
    baseurl = "https://www.quandoo.nl/amsterdam?page=" + i 
    r1 = requests.get(baseurl) 
    data = r1.text 
    soup = BeautifulSoup(data, "html.parser") 
    for link in soup.findAll('span', {'class', "merchant-title", 'itemprop', "name", 'a'}): 
     match = re.search(r'href=[\'"]?([^\'" >]+)', str(link)).group(0) 
     print match

来源

2016-11-22 17:40:06 zipa

谢谢！我之前尝试过这种配置;不过，我试图隔离页面上的餐厅链接。这些是我需要进一步研究的唯一。有关如何将hrefs隔离到页面上的餐馆的任何想法？ – dtrinh

你一直很有帮助！谢谢！ – dtrinh

欢迎您:) – zipa

Python/BeautifulSoup：检索'href'属性

回答

相关问题