如何检索一个href标签内的数据

嘿，我所遇到的一些困难，同时网络爬行。我试图获得嵌入在一些html中间的代码块中的70，我的问题是我将如何去做这件事。我尝试了各种方法，但似乎没有工作。我正在使用BeautifulSoup模块并使用Python 3编写。如果有人需要它，链接到我正在抓取的网站的链接方便的链接。感谢您提前。如何检索一个href标签内的数据

<a href="http://www.accuweather.com/en/gb/london/ec4a-2/weather- forecast/328328">London, United Kingdom<span class="temp">70&deg;</span><span class="icon i-33-s"></span></a> 

from bs4 import* 
import requests 
data = requests.get("http://www.accuweather.com/en/gb/london/ec4a-2/weather- forecast/328328") 
soup = BeautifulSoup(data.text,"html.parser")

来源

2016-08-11 goimpress

-1

from bs4 import BeautifulSoup 
import re 
import requests 
soup = BeautifulSoup(text,"html.parser") 
for link in soup.find("a") 
    temp = link.find("span",{"class" : "temp"}) 
    print(re.findall(r"[0-9]{1,2}",temp.text))

我希望这有助于你

来源

2016-08-11 22:11:06 ChE

感谢您的评论！但打印出所有的链接，即时通讯试图获得“70”的标签 – goimpress

假设使用BeautifulSoup不是一个严格的要求，你可以用html.parser模块做到这一点。下面是为您提到的用例定制设计的。它提取两个数据字段，然后过滤出数字。

from html.parser import HTMLParser 

class MyHTMLParser(HTMLParser): 
    def handle_data(self, data): 
     if data.isdigit(): 
      print(data) 

parser = MyHTMLParser() 

parser.feed('<a href="http://www.accuweather.com/en/gb/london/ec4a-2/weather- forecast/328328">London, United Kingdom<span class="temp">70&deg;</span><span class="icon i-33-s"></span></a>')

这将输出70

也可以使用正则表达式来完成。

来源

2016-08-11 22:26:25 v2b

它也可以这样做，但即时通讯试图网络刮天气网站，其中70是天气和我发送的标签是在一些html的中间 – goimpress

这将让你含温度

temps = soup.find_all('span',{'class':'temp'})

任何跨度然后遍历它

for span in temps: 
    temp = span.decode_contents() 
    # temp looks like "70&deg" or "70\xb0" so parse it 
    print int(temp[:-1])

艰苦的工作可能是从Unicode转换为ASCII码，如果你是在python2。

但ACCU-天气页面没有带班温度跨度：

In [12]: soup.select('[class~=temp]') 
Out[12]: 
[<strong class="temp">19<span>\xb0</span></strong>, 
<strong class="temp">14<span>\xb0</span></strong>, 
<strong class="temp">24<span>\xb0</span></strong>, 
<strong class="temp">23<span>\xb0</span></strong>, 
<h2 class="temp">19\xb0</h2>, 
<h2 class="temp">19\xb0</h2>, 
<h2 class="temp">17\xb0</h2>, 
<h2 class="temp">19\xb0</h2>, 
<h2 class="temp">19\xb0</h2>, 
<h2 class="temp">19\xb0</h2>, 
<h2 class="temp">20\xb0</h2>, 
<h2 class="temp">19\xb0</h2>, 
<h2 class="temp">17\xb0</h2>, 
<h2 class="temp">19\xb0</h2>, 
<h2 class="temp">19\xb0</h2>]

所以它很难给你一个答案

来源

2016-08-11 22:31:58 kdopen

起初它看起来像是要工作，但它没有 – goimpress

出了什么问题？为我工作 – kdopen

当然，精确天气中的那个页面不再使用具有临界温度的跨度。它使用'h2'和'strong'代替 – kdopen

您需要添加一个用户代理，以获得正确的来源，然后选择您要使用的标签/类名称：

from bs4 import * 
import requests 
headers = {"user-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"} 
data = requests.get("http://www.accuweather.com/en/gb/london/ec4a-2/weather-forecast/328328", headers=headers) 
soup = BeautifulSoup(data.content) 
print(soup.select_one("span.local-temp").text) 
print([span.text for span in soup.select("span.temp")])

如果我们运行的代码，你会看到我们得到我们所需要的：

In [17]: headers = { 
    ....:  "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"} 

In [18]: data = requests.get("http://www.accuweather.com/en/gb/london/ec4a-2/weather-forecast/328328", headers=headers) 

In [19]: soup = BeautifulSoup(data.content, "html.parser") 

In [20]: print(soup.find("span", "local-temp").text) 
18°C 

In [21]: print("\n".join([span.text for span in soup.select("span.temp")])) 
18° 
31° 
30° 
25°

来源

2016-08-11 22:49:06

兄弟！这工作。非常感谢你 – goimpress

不用担心，当你右键点击并选择查看源代码时，总是很好地检查从请求返回的源代码以及浏览器中的实际源代码。 –

我有几个问题，用户代理是什么，为什么你需要它。这些代码行是什么，soup.select_one（“span.local-temp”）。text和print（[span.text在span.template.select（“span.temp”）]） – goimpress

如何检索一个href标签内的数据

回答

相关问题