处理大熊猫数据帧（模糊匹配）

我想做模糊匹配，其中我从大数据框（130.000行）的列到列表（400行）的字符串进行匹配。我写的代码是在一个小样本上测试的（匹配3000行到400行）并且工作正常。它太大复制到这里，但它大致是这样的：处理大熊猫数据帧（模糊匹配）

1）列 2的数据标准化）创建笛卡尔积列和计算Levensthein距离 3）选择在单独的得分最高的比赛和商店的large_csv_name“名单。 4）比较'large_csv_names'到'large_csv'的列表，拉出所有相交的数据并写入一个csv。

由于笛卡尔产品包含超过5000万条记录，我很快遇到了内存错误。

这就是为什么我想知道如何将大数据集分成块，然后运行我的脚本。

到目前为止，我曾尝试：

df_split = np.array_split(df, x (e.g. 50 of 500)) 
for i in df_split: 
    (step 1/4 as above)

除了：

for chunk in pd.read_csv('large_csv.csv', chunksize= x (e.g. 50 or 500)) 
    (step 1/4 as above)

这些方法都似乎工作。我想知道如何在块中运行模糊匹配，即将大块的csv切成小块，运行代码，取一块，运行代码等。

来源

2017-09-03 Michiel V.

你可能想要检查[dask]（https://dask.pydata.org/en/latest/），它可以从磁盘上懒懒的加载数据帧 – Quickbeam2k1

与此同时，我写了一篇脚本，以块为单位切分数据帧，然后每个脚本都可以进一步处理。由于我是python的新手，代码可能有点混乱，但我仍然想与那些可能会陷入同样问题的人分享。

import pandas as pd 
import math 


partitions = 3 #number of ways to split df 
length = len(df) 

list_index = list(df.index.values) 
counter = 0  #var that will be used to stop slicing when df ends 
block_counter0 = 0  #var which will indicate the begin index of slice                
block_counter1 = block_counter0 + math.ceil(length/partitions) #likewise 
while counter < int(len(list_index)):  #stop slicing when df ends 
    df1 = df.iloc[block_counter0:block_counter1] #temp df that forms chunk 
    for i in range(block_counter0, block_counter1): 

     #insert operations on row of df1 here 

    counter += 1 #increase counter by 1 to stop slicing in time 
    block_counter0 = block_counter1 #when for loop ends indices areupdated 
    if block_counter0 + math.ceil(length/partitions) > 
      int(len(list_index)): 
     block_counter1 = len(list_index) 
     counter +=1 
    else: 
     block_counter1 = block_counter0 + math.ceil(length/partitions)

来源

2017-09-09 18:02:39

处理大熊猫数据帧（模糊匹配）

回答

相关问题