网络抓取：全部href

2017-05-11 150 views 1 likes

我写了一个小脚本，用python从网页中读取所有的hrefs。但它有一个问题。例如，它不会读取href="pages.php?ef=fa&page=n_fullstory.php&NewsIDn=1648"。网络抓取：全部href

代码：

import urllib 
import re 

urls = ["http://something.com"] 

regex='href=\"(.+?)\"' 
pattern = re.compile(regex) 

htmlfile = urllib.urlopen(urls[0]) 
htmltext = htmlfile.read() 
hrefs = re.findall(pattern,htmltext) 
print hrefs

任何人可以帮助我吗？谢谢。

来源

2017-05-11 Karim Pazoki

一般建议：不要用正则表达式解析HTML。虽然你可以实施你的特定案例，但如果你需要更多的东西，它可能会非常快速地变得非常混乱。改为使用正确的解析库。查看[BeautifulSoup]（https://www.crummy.com/software/BeautifulSoup/bs4/doc/）或['lxml.html']（http://lxml.de/lxmlhtml.html）。或者甚至可能是[Scrapy]（https://scrapy.org/）。 – drdaeman

回答

使用BEautifulSoup和请求静态网站。它是一个伟大的网页报废模块，使用代码，很容易就可以获得href标签内的值。希望它有帮助

import requests 
from bs4 import BeautifulSoup 

url = 'whatever url you want to parse' 

result = requests.get(url) 

soup = BeautifulSoup(result.content,'html.parser') 

for a in soup.find_all('a',href=True): 
    print "Found the URL:", a['href']

来源

2017-05-11 15:34:02 Exprator

网络抓取：全部href

回答

相关问题