我对Python很新,我正在学习如何刮网页(1天)。我想要实现的任务是循环访问2000家公司的名单,并提取收入数据和员工人数。我开始使用scrapy,并且我已经设法让一个公司的工作流程工作(不够优雅,但至少我在尝试) - 但我无法弄清楚如何加载公司列表并循环执行多次搜索。我有一种感觉,这是一个相当简单的过程。Scrapy多个搜索条件
所以,我的主要问题是 - 在蜘蛛类的哪里应该定义公司的查询数组来循环?我不知道确切的网址,因为每家公司都有唯一的ID并且属于特定的市场 - 所以我不能将它们输入为start_urls。
Scrapy是正确的工具还是我应该使用机械化 - 对于这种类型的任务?
这是我目前的代码。
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest
from scrapy.http import Request
from tutorial.items import DmozItem
import json
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["proff.se"]
start_urls = ["http://www.proff.se"]
# Search on the website, currently I have just put in a static search term here, but I would like to loop over a list of companies.
def parse(self, response):
return FormRequest.from_response(response, formdata={'q': rebtel},callback=self.search_result)
# I fetch the url from the search result and convert it to correct Financial-url where the information is located.
def search_result(self,response):
sel = HtmlXPathSelector(response)
link = sel.xpath('//ul[@class="company-list two-columns"]/li/a/@href').extract()
finance_url=str(link[0]).replace("/foretag","http://www.proff.se/nyckeltal")
return Request(finance_url,callback=self.parse_finance)
# I Scraped the information of this particular company, this is hardcoded and will not
# work for other responses. I had some issues with the encoding characters
# initially since they were Swedish. I also tried to target the Json element direct by
# revenue = sel.xpath('#//*[@id="accountTable1"]/tbody/tr[3]/@data-chart').extract()
# but was not able to parse it (error - expected string or buffer - tried to convert it
# to a string by str() with no luck, something off with the formatting, which is messing the the data types.
def parse_finance(self, response):
sel = HtmlXPathSelector(response)
datachart = sel.xpath('//tr/@data-chart').extract()
employees=json.loads(datachart[36])
revenue = json.loads(datachart[0])
items = []
item = DmozItem()
item['company']=response.url.split("/")[-5]
item['market']=response.url.split("/")[-3]
item['employees']=employees
item['revenue']=revenue
items.append(item)
return item
但是,这种方法可以获取元素列表吗? –
但是这种方法能否列出一些元素? scrapy抓取dmoz -a query =“companies.txt”将成为公司名单。 def _init_(self,query):companies = [line.strip()for line in open(query)] self.query = companies,也许我并不真正关注你的建议。 –
我想通了 - 实际上,搜索查询的起始网址是http://www.proff.se/bransch-s%C3%B6k?q =“公司名称” - 因此我可以将一个文件与所有名称合并,然后以start_urls的形式读取它,如果我想使用不同的文件集,则甚至可以执行_Init_。感谢你的回答。 –