2013-09-30 56 views
8

问题:

我想两个相对较小的数据集在一起,但合并提出了一个MemoryError。我有两个国家贸易数据汇总数据集,我试图在关键年份和国家进行合并,因此数据需要特殊放置。这不幸的是使用了concat,并且在这个问题的答案中看到了它的性能好处:MemoryError on large merges with pandas in Python熊猫合并错误:的MemoryError

这里的设置:

尝试的合并:

df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"]) 

基本数据结构:

我:

Year Reporter_Code Trade_Flow_Code Partner_Code Classification Commodity Code Quantity Unit Code Supplementary Quantity Netweight (kg) Value Estimation Code 
0 2003 381  2 36 H2 070951 8 1274 1274 13810 0 
1 2003 381  2 36 H2 070930 8 17150 17150 30626 0 
2 2003 381  2 36 H2 0709 8 20493 20493 635840 0 
3 2003 381  1 36 H2 0507 8 5200 5200 27619 0 
4 2003 381  1 36 H2 050400 8 56439 56439 683104 0 

DF:

mporter cod  CC ComTrade_CC Distance_miles 
0 110  215  215  757  428.989 
1 110  215  215  757  428.989 
2 110  215  215  757  428.989 
3 110  215  215  757  428.989 
4 110  215  215  757  428.989 

错误回溯:

MemoryError      Traceback (most recent call last) 
<ipython-input-10-8d6e9fb45de6> in <module>() 
     1 for i in c_list: 
----> 2  df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"]) 

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy) 
    36       right_index=right_index, sort=sort, suffixes=suffixes, 
    37       copy=copy) 
---> 38  return op.get_result() 
    39 if __debug__: 
    40  merge.__doc__ = _merge_doc % '\nleft : DataFrame' 

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self) 
    193          copy=self.copy) 
    194 
--> 195   result_data = join_op.get_result() 
    196   result = DataFrame(result_data) 
    197 

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self) 
    693     if klass in mapping: 
    694      klass_blocks.extend((unit, b) for b in mapping[klass]) 
--> 695    res_blk = self._get_merged_block(klass_blocks) 
    696 
    697    # if we have a unique result index, need to clear the _ref_locs 

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _get_merged_block(self, to_merge) 
    706  def _get_merged_block(self, to_merge): 
    707   if len(to_merge) > 1: 
--> 708    return self._merge_blocks(to_merge) 
    709   else: 
    710    unit, block = to_merge[0] 

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _merge_blocks(self, merge_chunks) 
    728   # Should use Fortran order?? 
    729   block_dtype = _get_block_dtype([x[1] for x in merge_chunks]) 
--> 730   out = np.empty(out_shape, dtype=block_dtype) 
    731 
    732   sofar = 0 

MemoryError: 

感谢您的想法!

+0

你似乎在'df'中有重复项当你删除重复项然后合并时会发生什么? 'df.drop_duplicates(inplace = True)' – EdChum

+0

它们实际上并不重复。 df实际上包含93列,每个观察对于年度和贸易伙伴都是独特的。我只想把一小部分数据放在SO上以避免混淆。感谢这个想法艰难!此外,合并似乎没有形式缺乏记忆。当我进行合并时,我没有使用超过50%的记忆。 – agconti

+0

不用担心,另一件需要检查的问题是,如果你在任何合并列中有任何NaN(null)值,那么你应该怎么做,但如果你有任何 – EdChum

回答

2

如果任何人碰到这个问题,未来仍然有类似的麻烦merge,你也许可以得到concat通过重命名这两个dataframes以相同名称的相关列工作,将它们设置为MultiIndex(即df = dv.set_index(['A','B'])),和然后使用concat加入他们。