Scrapy是节约网址三重斜杠///

我不知道为什么scrapy是这样做的，但它发生在不同的地方两次。Scrapy是节约网址三重斜杠///

我认为两次是因为我试图在http:添加到URL。

item['product_link'] = urljoin(ABS_URL,''.join(item['product_link']).replace('/', '').encode('utf-8').strip())

ABS被添加http: 还试图将它添加那里，但我一直都想与3 ///如果我不添加任何东西的项目只有一个/

来源

2017-08-12 Ignacio Art

那怎么urljoin作品。如果基仅包含方案（而不是任何域部分），结果将包含三斜杠：

>>> urlparse.urljoin('http://', 'foo.html') 
'http:///foo.html' 
>>> urlparse.urljoin('http:', 'foo.html') 
'http:///foo.html' 
>>> urlparse.urljoin('http://foo', 'bar.html') 
'http://foo/bar.html'

从你的代码看起来你用它只会增加计划，以形成product_link。在这种情况下，简单的拼接就足够了：

item['product_link'] = 'http:' + ''.join(item['product_link']).replace('/', '').encode('utf-8').strip()

来源

2017-08-12 06:34:43

Scrapy是节约网址三重斜杠///

回答

相关问题