2015-04-28 189 views
1

我很新的编程和Python和我尝试编写这个简单的刮刀在此页面中提取治疗师的所有个人资料的网址无法识别链接类

http://www.therapy-directory.org.uk/search.php?search=Sheffield&services[23]=1&business_type[individual]=1&distance=40&uqs=626693

import requests 
from bs4 import BeautifulSoup 

def tru_crawler(max_pages): 
    p = '&page=' 
    page = 1 
    while page <= max_pages: 
    url = 'http://www.therapy-directory.org.uk/search.php?search=Sheffield&distance=40&services[23]=on&services=23&business_type[individual]=on&uqs=626693' + p + str(page) 
    code = requests.get(url) 
    text = code.text 
    soup = BeautifulSoup(text) 
    for link in soup.findAll('a',{'member-summary':'h2'}): 
     href = 'http://www.therapy-directory.org.uk' + link.get('href') 
     yield href + '\n' 
     print(href) 
    page += 1 

现在,当我运行这个代码,我什么也没有,主要是因为soup.findall是空的。

个人资料链接的HTML显示

<div class="member-summary"> 
<h2 class=""> 
<a href="/therapists/julia-church?uqs=626693">Julia Church</a> 
</h2> 

所以我不知道在soup.findall通过(“A”),以获得个人资料的网址

请帮什么额外的参数

感谢

更新 -

我跑了修改后的代码和好吧,这一次它刮掉第1页之后返回了一堆错误

Traceback (most recent call last): 
File "C:/Users/PB/PycharmProjects/crawler/crawler-revised.py", line 19,  enter code here`in <module> 
tru_crawler(3) 
File "C:/Users/PB/PycharmProjects/crawler/crawler-revised.py", line 9, in tru_crawler 
code = requests.get(url) 
File "C:\Python27\lib\requests\api.py", line 68, in get 
return request('get', url, **kwargs) 
File "C:\Python27\lib\requests\api.py", line 50, in request 
response = session.request(method=method, url=url, **kwargs) 
File "C:\Python27\lib\requests\sessions.py", line 464, in request 
resp = self.send(prep, **send_kwargs) 
File "C:\Python27\lib\requests\sessions.py", line 602, in send 
history = [resp for resp in gen] if allow_redirects else [] 
File "C:\Python27\lib\requests\sessions.py", line 195, in resolve_redirects 
allow_redirects=False, 
File "C:\Python27\lib\requests\sessions.py", line 576, in send 
r = adapter.send(request, **kwargs) 
File "C:\Python27\lib\requests\adapters.py", line 415, in send 
raise ConnectionError(err, request=request) 
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",)) 

的什么事错在这里它返回一串错误

的?

回答

1

目前,参数findAll()你没有意义。它读取:找到所有<a>具有member-class属性等于“h2”。

一种可能的方式是使用select()方法传递CSS selector作为参数:

for link in soup.select('div.member-summary h2 a'): 
    href = 'http://www.therapy-directory.org.uk' + link.get('href') 
    yield href + '\n' 
    print(href) 

以上CSS选择读取:找出具有类<div>标签等于“构件-摘要”,则该<div>内找到<h2>标签,则内即<h2>找到<a>标记。

工作例如:

import requests 
from bs4 import BeautifulSoup 

p = '&page=' 
page = 1 
url = 'http://www.therapy-directory.org.uk/search.php?search=Sheffield&distance=40&services[23]=on&services=23&business_type[individual]=on&uqs=626693' + p + str(page) 
code = requests.get(url) 
text = code.text 
soup = BeautifulSoup(text) 
for link in soup.select('div.member-summary h2 a'): 
    href = 'http://www.therapy-directory.org.uk' + link.get('href') 
    print(href) 

输出(修剪,来自26个链接):

http://www.therapy-directory.org.uk/therapists/lesley-lister?uqs=626693 
http://www.therapy-directory.org.uk/therapists/fiona-jeffrey?uqs=626693 
http://www.therapy-directory.org.uk/therapists/ann-grant?uqs=626693 
..... 
..... 
http://www.therapy-directory.org.uk/therapists/jan-garbutt?uqs=626693 
+0

感谢这个,但它仍然不返回任何东西:( –

+0

@pb_ng嗯..为我工作(一连串的链接打印)看到更新的答案我是如何尝试 – har07

+0

谢谢,所以删除“yield href +'\ n”使它工作如果你不介意我问,为什么这样当使用Yield时,它没有返回任何东西? –