问题：

我想两个相对较小的数据集在一起，但合并提出了一个MemoryError。我有两个国家贸易数据汇总数据集，我试图在关键年份和国家进行合并，因此数据需要特殊放置。这不幸的是使用了concat，并且在这个问题的答案中看到了它的性能好处：MemoryError on large merges with pandas in Python。熊猫合并错误：的MemoryError

这里的设置：

尝试的合并：

df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"])

基本数据结构：

我：

Year Reporter_Code Trade_Flow_Code Partner_Code Classification Commodity Code Quantity Unit Code Supplementary Quantity Netweight (kg) Value Estimation Code 
0 2003 381  2 36 H2 070951 8 1274 1274 13810 0 
1 2003 381  2 36 H2 070930 8 17150 17150 30626 0 
2 2003 381  2 36 H2 0709 8 20493 20493 635840 0 
3 2003 381  1 36 H2 0507 8 5200 5200 27619 0 
4 2003 381  1 36 H2 050400 8 56439 56439 683104 0

DF：

mporter cod  CC ComTrade_CC Distance_miles 
0 110  215  215  757  428.989 
1 110  215  215  757  428.989 
2 110  215  215  757  428.989 
3 110  215  215  757  428.989 
4 110  215  215  757  428.989

错误回溯：

MemoryError      Traceback (most recent call last) 
<ipython-input-10-8d6e9fb45de6> in <module>() 
     1 for i in c_list: 
----> 2  df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"]) 

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy) 
    36       right_index=right_index, sort=sort, suffixes=suffixes, 
    37       copy=copy) 
---> 38  return op.get_result() 
    39 if __debug__: 
    40  merge.__doc__ = _merge_doc % '\nleft : DataFrame' 

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self) 
    193          copy=self.copy) 
    194 
--> 195   result_data = join_op.get_result() 
    196   result = DataFrame(result_data) 
    197 

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self) 
    693     if klass in mapping: 
    694      klass_blocks.extend((unit, b) for b in mapping[klass]) 
--> 695    res_blk = self._get_merged_block(klass_blocks) 
    696 
    697    # if we have a unique result index, need to clear the _ref_locs 

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _get_merged_block(self, to_merge) 
    706  def _get_merged_block(self, to_merge): 
    707   if len(to_merge) > 1: 
--> 708    return self._merge_blocks(to_merge) 
    709   else: 
    710    unit, block = to_merge[0] 

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _merge_blocks(self, merge_chunks) 
    728   # Should use Fortran order?? 
    729   block_dtype = _get_block_dtype([x[1] for x in merge_chunks]) 
--> 730   out = np.empty(out_shape, dtype=block_dtype) 
    731 
    732   sofar = 0 

MemoryError:

感谢您的想法！

来源

2013-09-30 agconti

你似乎在'df'中有重复项当你删除重复项然后合并时会发生什么？ 'df.drop_duplicates（inplace = True）' – EdChum

它们实际上并不重复。 df实际上包含93列，每个观察对于年度和贸易伙伴都是独特的。我只想把一小部分数据放在SO上以避免混淆。感谢这个想法艰难！此外，合并似乎没有形式缺乏记忆。当我进行合并时，我没有使用超过50％的记忆。 – agconti

不用担心，另一件需要检查的问题是，如果你在任何合并列中有任何NaN（null）值，那么你应该怎么做，但如果你有任何 – EdChum

如果任何人碰到这个问题，未来仍然有类似的麻烦merge，你也许可以得到concat通过重命名这两个dataframes以相同名称的相关列工作，将它们设置为MultiIndex（即df = dv.set_index(['A','B'])），和然后使用concat加入他们。

来源

2017-04-13 17:04:34

熊猫合并错误：的MemoryError

问题：

回答

相关问题