1
有人可以帮我解决这个问题,即时通讯使用scrapy/python。我似乎无法阻止重复的数据被插入到数据库中。举些例子。如果我的数据库的价格为4000美元马自达。如果'汽车'已经存在或'汽车价格'存在,我不希望蜘蛛再次插入抓取的数据。scrapy如何防止重复的数据被插入到数据库中
price | car
-------------
$4000 | Mazda <----
$3000 | Mazda 3 <----
$4000 | BMW
$4000 | Mazda 3 <---- I also dont want to have two results like this
$4000 | Mazda <---- I don't want to have two results any help will be greatly appreciated - Thanks
pipeline.py
-------------------
from scrapy import log
#from scrapy.core.exceptions import DropItem
from twisted.enterprise import adbapi
from scrapy.http import Request
from scrapy.exceptions import DropItem
from scrapy.contrib.pipeline.images import ImagesPipeline
import time
import MySQLdb
import MySQLdb.cursors
import socket
import select
import sys
import os
import errno
----------------------------------
when I put this peace of code, the crawled data does not save. but when removed it does save into the database.
class DuplicatesPipeline(object):
def __init__(self):
self.car_seen = set()
def process_item(self, item, spider):
if item['car'] in self.car_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.car_seen.add(item['car'])
return item
--------------------------------------
class MySQLStorePipeline(object):
def __init__(self):
self.dbpool = adbapi.ConnectionPool('MySQLdb',
db = 'test',
user = 'root',
passwd = 'test',
cursorclass = MySQLdb.cursors.DictCursor,
charset = 'utf8',
use_unicode = False
)
def _conditional_insert(self, tx, item):
if item.get('price'):
tx.execute(\
"insert into data (\
price,\
car \
) \
values (%s, %s)",
(item['price'],
item['car'],
)
)
def process_item(self, item, spider):
query = self.dbpool.runInteraction(self._conditional_insert, item)
return item
settings.py
------------
SPIDER_MODULES = ['car.spiders']
NEWSPIDER_MODULE = 'car.spiders'
ITEM_PIPELINES = ['car.pipelines.MySQLStorePipeline']