我有一个变量DOMAIN,它将url作为输入。我想从txt文件中逐一提供URL列表。在python中将输入从txt逐行输入到变量中
我的txt文件看起来是这样的:
www.yahoo.com
www.google.com
www.bing.com
我这样做:
with open('list.txt') as f:
content = f.readlines()
content = [x.strip() for x in content]
DOMAIN = content
但可变域采取的所有URL一次,而不是分开。它必须整个处理一个URL,并在另一个操作中处理第二个URL。
请注意,此DOMAIN变量是供scrapy进行爬网的。代码库的一部分 :
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
with open('list.txt') as f:
content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
content = [x.strip() for x in content]
DOMAIN = content
URL = 'http://%s' % DOMAIN
class MySpider(BaseSpider):
name = DOMAIN
allowed_domains = [DOMAIN]
start_urls = [
URL
]
错误:
对于单个URLscrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://['www.google.com', 'www.yahoo.com', 'www.bing.com']>
executing as scrapy runspider spider.py
完全工作脚本---
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
DOMAIN = 'google.com'
URL = 'http://%s' % DOMAIN
class MySpider(BaseSpider):
name = DOMAIN
allowed_domains = [DOMAIN]
start_urls = [
URL
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
for url in hxs.select('//a/@href').extract():
if not (url.startswith('http://') or url.startswith('https://')):
url= URL + url
print url
yield Request(url, callback=self.parse)
我实际上正在收到错误,,要说清楚,iam正在上传总体脚本 – user7423959