我是新来的python,我正在开发一个网络爬虫下面是从给定的网址获取链接的程序,但问题是我不希望它访问已经是相同的url参观。请帮帮我。不应该访问相同的网址
import re
import urllib.request
import sqlite3
db = sqlite3.connect('test2.db')
db.row_factory = sqlite3.Row
db.execute('drop table if exists test')
db.execute('create table test(id INTEGER PRIMARY KEY,url text)')
#linksList = []
#module to vsit the given url and get the all links in that page
def get_links(urlparse):
try:
if urlparse.find('.msi') ==-1: #check whether the url contains .msi extensions
htmlSource = urllib.request.urlopen(urlparse).read().decode("iso-8859-1")
#parsing htmlSource and finding all anchor tags
linksList = re.findall('<a href=(.*?)>.*?</a>',htmlSource) #returns href and other attributes of a tag
for link in linksList:
start_quote = link.find('"') # setting start point in the link
end_quote = link.find('"', start_quote + 1) #setting end point in the link
url = link[start_quote + 1:end_quote] # get the string between start_quote and end_quote
def concate(url): #since few href may return only /contact or /about so concatenating its baseurl
if url.find('http://'):
url = (urlparse) + url
return url
else:
return url
url_after_concate = concate(url)
# linksList.append(url_after_concate)
try:
if url_after_concate.find('.tar.bz') == -1: # skipping links which containts link to some softwares or downloads page
db.execute('insert or ignore into test(url) values (?)', [url_after_concate])
except:
print("insertion failed")
else:
return True
except:
print("failed")
get_links('http://www.python.org')
cursor = db.execute('select * from test')
for row in cursor: # retrieve the links stored in database
print (row['id'],row['url'])
urlparse = row['url']
# print(linksList)
# if urlparse in linksList == -1:
try:
get_links(urlparse) # again parse the link from database
except:
print ("url error")
请告诉我如何解决问题的方式。
有几点意见。你的函数有太多的嵌套层次。将'concate'函数移除到'get_links'之外。此外,它是“连接”。不要使用正则表达式来解析HTML。使用像BeautifulSoup这样的库。不要吞下异常,除了:并且不打印诊断信息。 – 2012-04-16 09:45:54
我打算继续问:你是否考虑过使用'wget',递归网络下载器,然后处理'wget'为你检索的内容? – 2012-04-16 10:01:01
@ Li-aungYip先生,我没有使用它。但我认为wget是从给定的网址获取一些内容。在这里,我只想获得所有hrefs的价值。 – Shreedhar 2012-04-16 10:11:32