0
如何让我的parse_page显示我的项目标题的文本和数值?我只能显示href。Scrapy使用lxml显示xpath文本
def parse_page(self, response):
self.log("\n\n\n Page for one device \n\n\n")
self.log('Hi, this is the parse_page page! %s' % response.url)
root = lxml.etree.fromstring(response.body)
for row in root.xpath('//row'):
allcells = row.xpath('./cell')
#... populate Items
for cells in allcells:
item = CiqdisItem()
item['title'] = cells.get(".//text()")
item['link'] = cells.get("href")
yield item
我的XML文件
<row>
<cell type="html">
<input type="checkbox" name="AF2C4452827CF0935B71FAD58652112D" value="AF2C4452827CF0935B71FAD58652112D" onclick="if(typeof(selectPkg)=='function')selectPkg(this);">
</cell>
<cell type="plain" style="width: 50px; white-space: nowrap;" visible="false">http://qvpweb01.ciq.labs.att.com:8080/dis/metriclog.jsp?PKG_GID=AF2C4452827CF0935B71FAD58652112D&view=list</cell>
<cell type="plain">6505550000</cell>
<cell type="plain">probe0</cell>
<cell type="href" style="width: 50px; white-space: nowrap;" href="metriclog.jsp?PKG_GID=AF2C4452827CF0935B71FAD58652112D&view=list">
UPTR
<input id="savePage_AF2C4452827CF0935B71FAD58652112D" type="hidden" value="AF2C4452827CF0935B71FAD58652112D">
</cell>
<cell type="href" href="/dis/packages.jsp?show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&mdn=6505550000&subscrbid=6505550000&triggerfilter=&maxlength=100&view=timeline&date=20100716T050314876" style="white-space: nowrap;">2010-07-16 05:03:14.876</cell>
<cell type="plain" style="width: 50px; white-space: nowrap;"></cell>
<cell type="plain" style="white-space: nowrap;"></cell>
<cell type="plain" style="white-space: nowrap;">2012-10-22 22:40:15.504</cell>
<cell type="plain" style="width: 70px; white-space: nowrap;">1 - SMS_PullRequest_CS</cell>
<cell type="href" style="width: 50px; white-space: nowrap;" href="/dis/profile_download?profileId=4294967295">4294967295</cell>
<cell type="plain" style="width: 50px; white-space: nowrap;">250</cell>
</row>
这是我最新的下方编辑,我展示这两种方法。问题是第一种方法没有按顺序解析列A中的所有链接,它是不合理的,如果列A为空,它将抓取列B中的下一个链接。我如何才能显示只有列A,并且如果列A为null跳过它并沿同一列A向下走?
方法2 parse_page。不会迭代所有行。它是不完整的解析。我如何获得所有行?
def parse_device_list(self, response):
self.log("\n\n\n List of devices \n\n\n")
self.log('Hi, this is the parse_device_list page! %s' % response.url)
root = lxml.etree.fromstring(response.body)
for row in root.xpath('//row'):
allcells = row.xpath('.//cell')
# first cell contain the link to follow
detail_page_link = allcells[0].get("href")
yield Request(urlparse.urljoin(response.url, detail_page_link), callback=self.parse_page)
def parse_page(self, response):
self.log("\n\n\n Page for one device \n\n\n")
self.log('Hi, this is the parse_page page! %s' % response.url)
xxs = XmlXPathSelector(response)
for row in xxs.select('//row'):
for cell in row.select('.//cell'):
item = CiqdisItem()
item['title'] = cell.select("text()").extract()
item['link'] = cell.select("@href").extract()
yield item
感谢@alecxe - 我只是需要从HXS更改为XXS = XmlXPathSelector(响应)。另外我发布了另一个问题[链接到我的第二个问题](http://stackoverflow.com/questions/17861781/convert-lxml-to-scrapy-xxs-selector)将lxml转换为scrapy构建xxs。对于这一个,我第一次尝试在xxs中做,但失败了,直到有人告诉我可能试过lxml让它工作,它在lxml中工作。 – Gio
嗨@alecxe - 在分析我的网络爬虫之后,我注意到这并不解析每个表的所有行,它只分成几行但不是全部。所有行都在同一页面中。 (它不会限制每页的行数)。如果我编辑我的问题并粘贴两个方法,这可能会有所帮助。 – Gio