BSON无法编码对象

我正在使用Scrapy来抓取一个网站，并且我正在生成一个非常大的文档 - 有3个属性，其中一个是超过5000个对象的数组，每个对象里面有一些属性和小数组。总的来说，如果它写入一个文件并不是那么大，它应该高于2MB。BSON无法编码对象

抓取对象后，我使用scrapy-mongodb管道将其插入数据库。每次，我得到一个错误，因为在这个要点的那些：https://gist.github.com/ranisalt/ac572185e11e5918082b

（有6个失误总共1为每个对象，但爬虫输出过大，切断）

那些没有对象编码在我在第一行中提到的大阵列上。

什么可能会使一个对象无法被pymongo编码以及可能会应用于我的文档？

如果有需要的任何东西请你在评论

来源

2015-01-06 ranisalt

我试图插入其中一个文件，它对我来说没有任何错误，你使用的是哪个版本的mongodb，以及如何在db上插入文件？ –

我正在使用版本2.4.6。这些示例不是我想要插入的文档，而是嵌套在文档中的对象。我要上传整个文档。 – ranisalt

这里是：https：//gist.github.com/ranisalt/d7320d6993664e87b7c0这是一个要插入的整个文档 – ranisalt

您遇到的问题，我相信是由于转义字符在Python插入到MongoDB的前没有充分转化为UTF-8格式。

我还没有检查MongoDB更改日志，但如果我没有记错，应该支持v.2.2 +完整unicode。

无论如何，你有两种方法，升级到更新版本的mongoDB 2.6，或修改/覆盖你的scrapy-mongodb脚本。要改变scrapy_mongodb.py，看看这些线，ķ插入的MongoDB之前没有转化为UTF-8：

# ... previous code ... 
     key = {} 
     if isinstance(self.config['unique_key'], list): 
      for k in dict(self.config['unique_key']).keys(): 
       key[k] = item[k] 
     else: 
      key[self.config['unique_key']] = item[self.config['unique_key']] 

     self.collection.update(key, item, upsert=True) 
# ... and the rest ...

为了解决这个问题，你可以process_item函数中添加这几行：

# ... previous code ... 
def process_item(self, item, spider): 
    """ Process the item and add it to MongoDB 
    :type item: Item object 
    :param item: The item to put into MongoDB 
    :type spider: BaseSpider object 
    :param spider: The spider running the queries 
    :returns: Item object 
    """ 
    item = dict(self._get_serialized_fields(item)) 
    # add a recursive function to convert all unicode to utf-8 format 
    # take this snippet from this [SO answer](http://stackoverflow.com/questions/956867/how-to-get-string-objects-instead-of-unicode-ones-from-json-in-python) 
    def byteify(input): 
     if isinstance(input, dict): 
      return {byteify(key):byteify(value) for key,value in input.iteritems()} 
     elif isinstance(input, list): 
      return [byteify(element) for element in input] 
     elif isinstance(input, unicode): 
      return input.encode('utf-8') 
      # if above utf-8 conversion still not working, replace them completely 
      # return input.encode('ASCII', 'ignore') 
     else: 
      return input 
    # finally replace the item with this function 
    item = byteify(item) 
    # ... rest of the code ... #

如果这仍然不起作用，建议将您的mongodb升级到更新的版本。

希望这会有所帮助。

来源

2015-01-06 22:47:42 Anzel

我不相信这是Mongo问题。我已经调整了你的byteify函数，它有助于“unicode”以前unicode的字符串，但是它们会被双重转义。在'História'被unicode转义为'Hist \ xf3ria'的情况下，现在它是'Hist \ xc3 \ xb3ria'，我仍然无法插入。 – ranisalt

@ranisalt，你有没有试过'.encode（'ASCII'，'ignore'）'实际删除unicode？ – Anzel

是的，现在试了一下，再次没有奏效。我会尝试更新Mongo。 – ranisalt

BSON无法编码对象

回答

相关问题