2015-11-16 42 views
2

我是新来scrapy和运行时,蜘蛛抓取behanceScrapy - 无法导入的项目,我的蜘蛛(无模块名behance.items)

import scrapy 
from scrapy.selector import Selector 
from behance.items import BehanceItem 
from selenium import webdriver 
from scrapy.http import TextResponse 

from scrapy.crawler import CrawlerProcess 

class DmozSpider(scrapy.Spider): 
    name = "behance" 
    #allowed_domains = ["behance.com"] 
    start_urls = [ 

     "https://www.behance.net/gallery/29535305/Mind-Your-Monsters", 


    ] 


    def __init__ (self): 
     self.driver = webdriver.Firefox() 

    def parse(self, response): 

      self.driver.get(response.url) 
      response = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8') 
      item = BehanceItem() 
      hxs = Selector(response) 

      item['link'] = response.xpath("//div[@class='js-project-module-image-hd project-module module image project-module-image']/@data-hd-src").extract() 

      yield item 

process = CrawlerProcess({ 
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)' 
}) 

process.crawl(DmozSpider) 
process.start() 

我正在下面的命令错误当我运行线我的履带

回溯(最近通话最后一个): 文件 “/home/davy/behance/behance/spiders/behance_spider.py”,3号线,在 从behance.items导入BehanceItem

ImportError:个无模块命名behance.items

我的目录结构:

behance/ 
├── behance 
│ ├── __init__.py 
│ ├── items.py 
│ ├── pipelines.py 
│ ├── settings.py 
│ └── spiders 
│  ├── __init__.py 
│  └── behance_spider.py 
-── scrapy.cfg 
+0

什么是你的items.py文件的内容? – narko

+0

@narko'进口scrapy 类BehanceItem(scrapy.Item): #定义字段您的项目在这里就像: #NAME = scrapy.Field() 链接= scrapy.Field()' – Davy

回答

0

尝试通过使用此命令运行你的蜘蛛:

scrapy crawl behance 

或者改变你的蜘蛛文件:

import scrapy 
from scrapy.selector import Selector 
from behance.items import BehanceItem 
from selenium import webdriver 
from scrapy.http import TextResponse 

from scrapy.crawler import CrawlerProcess 

class BehanceSpider(scrapy.Spider): 
    name = "behance" 
    allowed_domains = ["behance.com"] 
    start_urls = [ 

    "https://www.behance.net/gallery/29535305/Mind-Your-Monsters", 


] 


    def __init__ (self): 
     self.driver = webdriver.Firefox() 

    def parse(self, response): 

     self.driver.get(response.url) 
     response = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8') 
     item = BehanceItem() 
     hxs = Selector(response) 

     item['link'] = response.xpath("//div[@class='js-project-module-image-hd project-module module image project-module-image']/@data-hd-src").extract() 

     yield item 

并在目录w中创建另一个python文件这里你的settings.py文件驻留。

run.py

from scrapy.crawler import CrawlerProcess 
from scrapy.utils.project import get_project_settings 

process = CrawlerProcess(get_project_settings()) 

process.crawl("behance") 
process.start() 

现在,当你运行正常的Python脚本运行此文件。 python run.py

+0

我知道使用命令也可以工作,但我想在python脚本内运行spider而不是使用命令 – Davy

+0

然后我建议你改变你的蜘蛛文件并通过'CrawlerProcess'运行蜘蛛。我正在更新答案看看。 – Rahul

0

你可以把它添加到你的Python路径:

export PYTHONPATH=$PYTHONPATH:/home/davy/behance/