我已经使用scrapy编写了一些python代码以从网站中提取一些地址。python scrapy代码打印出我正在阅读的文件
代码的第一部分是通过从单独的文件googlecoords.txt中读取纬度和经度坐标,然后形成start_urls
的一部分,将start_urls
放在一起。 (我以前准备的googlecoords.txt文件将英国邮政编码转换为谷歌地图的谷歌坐标)。
因此,例如,在start_url
列表中的第一项是“https://www.howdens.com/process/searchLocationsNear.php?lat=53.674434&lon=-1.4908923&distance=1000&units=MILES”,其中“土地增值税= 53.674434 & LON = -1.4908923”都来自于googlecoors.txt文件。
但是,当我运行代码时,它的工作原理非常完美,只是它首先打印出googlecoords.txt文件 - 我不需要。
如何停止此打印? (虽然我可以住在一起。)
import scrapy
import sys
from scrapy.http import FormRequest, Request
from Howdens.items import HowdensItem
class howdensSpider(scrapy.Spider):
name = "howdens"
allowed_domains = ["www.howdens.com"]
# read the file that has a list of google coordinates that are converted from postcodes
with open("googlecoords.txt") as f:
googlecoords = [x.strip('\n') for x in f.readlines()]
# from the goole coordinates build the start URLs
start_urls = []
for a in range(len(googlecoords)):
start_urls.append("https://www.howdens.com/process/searchLocationsNear.php?{}&distance=1000&units=MILES".format(googlecoords[a]))
# cycle through 6 of the first relevant items returned in the text
def parse(self, response):
for sel in response.xpath('/html/body'):
for i in range(0,6):
try:
item = HowdensItem()
item['name'] =sel.xpath('.//text()').re(r'(?<="name":")(.*?)(?=","street")')[i]
item['street'] =sel.xpath('.//text()').re(r'(?<="street":")(.*?)(?=","town")')[i]
item['town'] = sel.xpath('.//text()').re(r'(?<="town":")(.*?)(?=","pc")')[i]
item['pc'] = sel.xpath('.//text()').re(r'(?<="pc":")(.*?)(?=","state")')[i]
yield item
except IndexError:
pass
的数据是JSON ...使用json解析器与它一起工作... –