2015-10-19 48 views
0

我想收集一堆使用xpath的链接,它需要从下一页中抓取,但是,我不断收到错误,只能解析字符串?我尝试着看看lk的类型,并且在我铸造它之后它是一个字符串?什么似乎是错的?ValueError:只能解析字符串python

def unicode_to_string(types): 
    try: 
     types = unicodedata.normalize("NFKD", types).encode('ascii', 'ignore') 
     return types 
    except: 
     return types 

def getData(): 
    req = "http://analytical360.com/access-points" 
    page = urllib2.urlopen(req) 
    tree = etree.HTML(page.read()) 
    i = 0 
    for lk in tree.xpath('//a[@class="sabai-file sabai-file-image sabai-file-type-jpg "]//@href'): 
     print "Scraping Vendor #" + str(i) 
     trees = etree.HTML(urllib2.urlopen(unicode_to_string(lk))) 
     for ll in trees.xpath('//table[@id="archived"]//tr//td//a//@href'): 
     final = etree.HTML(urllib2.urlopen(unicode_to_string(ll))) 
+1

你可以发布完整的追溯? – jgritty

+1

在一个部分你有'page = urllib2.urlopen(req); etree.HTML(page.read())'在下一个部分中有'etree.HTML(urllib2.urlopen(unicode_to_string(ll)))'丢失urlopen返回对象上的'.read()'。 – TessellatingHeckler

+1

你需要传递一个不是urllib2.urlopen对象的字符串给'unicode_to_string' –

回答

1

你应该传递字符串而不是urllib2.orlopen。

可能更改代码,如下所示:

trees = etree.HTML(urllib2.urlopen(unicode_to_string(lk)).read()) 
    for i, ll in enumerate(trees.xpath('//table[@id="archived"]//tr//td//a//@href')): 
     final = etree.HTML(urllib2.urlopen(unicode_to_string(ll)).read()) 

而且,你似乎并没有增加i