保存和恢复scraperwiki - CPU时间

这是我第一次这样做，所以我最好为我的菜鸟错误提前道歉。我试图通过搜索状态中的第一个和最后一个名字来为legacy.com搜索首页结果。我是新手编程，并使用scraperwiki来执行代码。它很有效，但在10,000个ish查询有时间处理之前，我耗尽了CPU时间。现在我试图保存进度，在时间不足的时候赶上，然后从停止的地方恢复。保存和恢复scraperwiki - CPU时间

我无法保存工作，任何与其他部分的帮助也将不胜感激。到目前为止，我只是抓住了链接，但是如果有一种方法可以保存链接页面的主要内容，那也是非常有用的。

这里是我的代码：

import scraperwiki 

from urllib import urlopen 
from BeautifulSoup import BeautifulSoup 

f = open('/tmp/workfile', 'w') 
#read database, find last, start from there 

def searchname(fname, lname, id, stateid): 
    url = 'http://www.legacy.com/ns/obitfinder/obituary-search.aspx?daterange=Last1Yrs&firstname= %s &lastname= %s &countryid=1&stateid=%s&affiliateid=all' % (fname, lname, stateid) 
    obits=urlopen(url) 
    soup=BeautifulSoup(obits) 
    obits_links=soup.findAll("div", {"class":"obitName"}) 
    print obits_links 
    s = str(obits_links) 
    id2 = int(id) 
    f.write(s) 
    #save the database here 
    scraperwiki.sqlite.save(unique_keys=['id2'], data=['id2', 'fname', 'lname', 'state_id', 's']) 


# Import Data from CSV 
import scraperwiki 
data = scraperwiki.scrape("https://dl.dropbox.com/u/14390755/legacy.csv") 
import csv 
reader = csv.DictReader(data.splitlines()) 
for row in reader: 
    #scraperwiki.sqlite.save(unique_keys=['id'], 'fname', 'lname', 'state_id', data=row) 
    FNAME = str(row['fname']) 
    LNAME = str(row['lname']) 
    ID = str(row['id']) 
    STATE = str(row['state_id']) 
    print "Person: %s %s" % (FNAME,LNAME) 
    searchname(FNAME, LNAME, ID, STATE) 


f.close() 
f = open('/tmp/workfile', 'r') 
data = f.read() 
print data

来源

2012-06-19 Jon P

Scraperwiki是一个可爱的概念，但它还没有准备好黄金时段。我想说你的第一个错误就是选择一家f'd-company竞争者作为平台。 – pguardiario

在CSV循环的底部，写各FNAME LNAME + +的状态结合save_var。然后，在该循环之前，添加另一个遍历行的循环，而不处理它们直到它传递保存的值。

您应该可以将整个网页写入数据存储，但我没有测试过。

来源

2012-07-08 03:30:19

保存和恢复scraperwiki - CPU时间

回答

相关问题