我有一个scrapy项目的例子。它几乎是默认的。它的文件夹结构:如何将简单项目与scrapy项目结合使用?
craiglist_sample/
├── craiglist_sample
│ ├── __init__.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ ├── test.py
└── scrapy.cfg
当你写scrapy crawl craigs -o items.csv -t csv
到Windows命令提示符写入Craiglist上的项目和链接到控制台。
我想在主文件夹中创建一个example.py并将它们打印到python控制台中。
我试图
from scrapy import cmdline
cmdline.execute("scrapy crawl craigs".split())
但作为Windows外壳输出写入相同。我怎样才能让它只打印项目和列表?
test.py :
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craiglist_sample.items import CraiglistSampleItem
class MySpider(CrawlSpider):
name = "craigs"
## allowed_domains = ["sfbay.craigslist.org"]
## start_urls = ["http://sfbay.craigslist.org/npo/"]
allowed_domains = ["craigslist.org"]
start_urls = ["http://sfbay.tr.craigslist.org/search/npo?"]
##search\/npo\?s=
rules = (Rule (SgmlLinkExtractor(allow=('s=\d00',),restrict_xpaths=('//a[@class="button next"]',))
, callback="parse_items", follow= True),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//span[@class="pl"]')
## titles = hxs.select("//p[@class='row']")
items = []
for titles in titles:
item = CraiglistSampleItem()
item ["title"] = titles.select("a/text()").extract()
item ["link"] = titles.select("a/@href").extract()
items.append(item)
return(items)
感谢您的回答,但我需要从脚本运行。我发现这个网页http://doc.scrapy.org/en/0.16/topics/practices.html#run-scrapy-from-a-script。如果我在该目录中创建一个.py文件,testpider似乎可以工作。你能否为我的蜘蛛“MySpider”修改这个蜘蛛https://github.com/scrapinghub/testspiders/blob/master/testspiders/spiders/followall.py? – St3114 2015-01-21 10:32:34
建议的修改已将它与您的工作集成在一起。使用您写下的脚本:“从scrapy导入cmdline cmdline.execute(”scrapy crawl craigs“.split())” – aberna 2015-01-21 10:53:43
@ St3114建议的解决方案是否适合您? – aberna 2015-01-23 09:58:08