与Python中的大熊猫进行大型合并时的MemoryError

我正在使用熊猫在约1000-2000个CSV文件集上执行outer合并。每个CSV文件都有一个在所有CSV文件之间共享的标识符列id，但每个文件都有一组唯一的3-5列列。每个文件中大约有20,000行独特的id行。我想要做的就是将它们合并在一起，将所有新列合并在一起，并使用id列作为合并索引。与Python中的大熊猫进行大型合并时的MemoryError

我用一个简单的merge调用它：

merged_df = first_df # first csv file dataframe 
for next_filename in filenames: 
    # load up the next df 
    # ... 
    merged_df = merged_df.merge(next_df, on=["id"], how="outer")

的问题是，有近2000 CSV文件，我用熊猫抛出的merge操作获得MemoryError。我不确定这是否是由于合并操作中的问题造成的限制？

最终的数据框将有20,000行，大致（2000 x 3）= 6000列。这很大，但不够大，无法消耗我使用的内存超过20 GB的计算机上的所有内存。这个大小对于熊猫的操作来说太多了吗？我应该使用类似sqlite的东西吗？在merge操作中是否可以更改某些操作以使其适用于此规模？

谢谢。

来源

2013-06-19 user248237dfsf

我想你会使用concat（这就像一个外部联接）获得更好的性能：

dfs = (pd.read_csv(filename).set_index('id') for filename in filenames) 
merged_df = pd.concat(dfs, axis=1)

这意味着你正在做的只有一个合并操作，而不是为每个文件。

来源

2013-06-19 19:04:37

至于内存，你应该能够使用第二代的表达，而不是列表理解...（虽然不知道'concat'的内部工作原理） – root

@root好吧，发生器只能是更好，我认为（最坏的情况下，它只是将它转换为列表）:) –

@root好点btw ！（tbh我不知道concat会接受一个发电机！） –

我遇到了32位pytwhen中使用read_csv与1GB文件相同的错误。尝试64位版本，并希望将解决内存错误问题

来源

2014-12-18 07:01:57

pd.concat似乎用尽大型数据帧的内存，一个选项是将dfs转换为矩阵和concat这些。

def concat_df_by_np(df1,df2): 
    """ 
    accepts two dataframes, converts each to a matrix, concats them horizontally and 
    uses the index of the first dataframe. This is not a concat by index but simply by 
    position, therefore the index of both dataframes should be the same 
    """ 
    dfout = deepcopy(pd.DataFrame(np.concatenate((df1.as_matrix(),df2.as_matrix()),axis=1), 
            index = df1.index, 
            columns = np.concatenate([df1.columns,df2.columns]))) 
    if (df1.index!=df2.index).any(): 
     #logging.warning('Indices in concat_df_by_np are not the same')      
     print ('Indices in concat_df_by_np are not the same')      


    return dfout

然而，需要小心，因为这个功能是不是加入，而是水平追加，而在指数被忽略

来源

2017-03-30 13:33:14 horseshoe

与Python中的大熊猫进行大型合并时的MemoryError

回答

相关问题