在Amazon EC2 linux实例上托管的scrapyd实例的输入/输出

最近我开始使用scrapy构建web scrapers。最初我使用scrapyd在本地部署了我的scrapy项目。在Amazon EC2 linux实例上托管的scrapyd实例的输入/输出

的scrapy项目我建立依赖于为了运行

def search(self, response): 
    with open('data.csv', 'rb') as fin: 
     reader = csv.reader(fin) 
     for row in reader: 
      subscriberID = row[0] 
      newEffDate = datetime.datetime.now() 
      counter = 0 
      yield scrapy.Request(
       url = "https://www.healthnet.com/portal/provider/protected/patient/results.action?__checkbox_viewCCDocs=true&subscriberId=" + subscriberID + "&formulary=formulary", 
       callback = self.find_term, 
       meta = { 
        'ID': subscriberID, 
        'newDate': newEffDate, 
        'counter' : counter 
        } 
       )

它的输出刮到另一个数据从一个CSV文件中访问数据CSV文件

for x in data: 
     with open('missing.csv', 'ab') as fout: 
      csvwriter = csv.writer(fout, delimiter = ',') 
      csvwriter.writerow([oldEffDate.strftime("%m/%d/%Y"),subscriberID,ipa]) 
      return

我们正处在初始阶段构建需要访问和运行这些scrapy蜘蛛的应用程序。我决定在AWS EC2 linux实例上托管我的scrapyd实例。部署到AWS非常简单（http://bgrva.github.io/blog/2014/04/13/deploy-crawler-to-ec2-with-scrapyd/）。

如何向运行在AWS EC2 linux实例上的scrapyd实例输入/输出抓取的数据？

编辑：我假设传递一个文件看起来像

curl http://my-ec2.amazonaws.com:6800/schedule.json -d project=projectX -d spider=spider2b -d in=file_path

这是正确的吗？我将如何获得这个蜘蛛跑的输出？这种方法是否存在安全问题？

来源

2017-02-16 Eitan Seri-Levi

S3是一个选项吗？我在问，因为你已经在使用EC2。如果是这样的话，你可以从S3读/写。

我有点困惑，因为你提到了CSV和JSON格式。如果您正在阅读CSV，则可以使用CSVFeedSpider。无论哪种方式，您也可以使用boto从S3中读取您的蜘蛛的__init__或start_requests方法。

关于输出，this page解释了如何使用Feed输出将抓取的输出写入S3。

在Amazon EC2 linux实例上托管的scrapyd实例的输入/输出

回答

相关问题