慢两种DataFrames

之间的模糊匹配我的数据框A（df_cam）与CLI ID和原产地：与快捷方式和活动慢两种DataFrames

cli id |   origin 
------------------------------------ 
123 | 1234 M-MKT XYZklm 05/2016

而且数据框B（df_dict）

shortcut |   campaign 
------------------------------------ 
M-MKT | Mobile Marketing Outbound

我知道，示例客户端来源1234 M-MKT XYZklm 05/2016实际上来自广告系列Mobile Marketing Outbound，因为它包含关键字M-MKT。

注意，快捷方式是一般的关键词，基于算法应该决定什么。起源也可以是M-Marketing,MMKT或Mob-MKT。我首先通过分析所有来源手动创建了快捷方式列表。我也使用正则表达式来清除origin，然后将其提取到程序中。

我想通过快捷与运动相匹配的客户来源和重视分数看出区别。正如下面显示：

cli id | shortcut |   origin   |  campaign   | Score 
--------------------------------------------------------------------------------- 
123 | M-MKT | 1234 M-MKT XYZklm 05/2016 | Mobile Marketing Outbound | 0.93

下面是我的程序，它的工作原理，但真正慢。 DataFrame A具有〜400.000行，另一个DataFrame B具有〜40行。

有没有办法让我的速度更快？

from fuzzywuzzy import fuzz 
list_values = df_dict['Shortcut'].values.tolist() 

def TopFuzzMatch(tokenA, dict_, position, value): 
    """ 
    Calculates similarity between two tokens and returns TOP match and score 
    ----------------------------------------------------------------------- 
    tokenA: similarity to this token will be calculated 
    dict_a: list with shortcuts 
    position: whether I want first, second, third...TOP position 
    value: 0=similarity score, 1=associated shortcut 
    ----------------------------------------------------------------------- 
    """ 
    sim = [(fuzz.token_sort_ratio(x, tokenA),x) for x in dict_] 
    sim.sort(key=lambda tup: tup[0], reverse=True) 
    return sim[position][value] 

df_cam['1st_choice_short'] = df_cam.apply(lambda x: TopFuzzMatch(x['cli_origin'],list_values,0,1), axis=1) 
df_cam['1st_choice_sim'] = df_cam.apply(lambda x: TopFuzzMatch(x['cli_origin'],list_values,0,0), axis=1)

请注意，我还想计算第2次和第3次最佳匹配以评估准确性。

编辑

我发现process.ExtractOne方法，但速度保持不变。所以我的代码看起来像现在这样：

def TopFuzzMatch(token, dict_, value): 
    score = process.extractOne(token, dict_, scorer=fuzz.token_sort_ratio) 
    return score[value]

来源

2016-11-25 HonzaB

我找到了解决办法 - 在我干净的正则表达式起源柱（没有数字和特殊字符），也有只有几百重复不同的值，所以我计算Fuzz算法就在那些上，这大大缩短了时间。

def TopFuzzMatch(df_cam, df_dict): 
    """ 
    Calculates similarity bewteen two tokens and return TOP match 
    The idea is to do it only over distinct values in given DF (takes ages otherwise) 
    ----------------------------------------------------------------------- 
    df_cam: DataFrame with client id and origin 
    df_dict: DataFrame with abbreviation which is matched with the description i need 
    ----------------------------------------------------------------------- 
    """ 
    #Clean special characters and numbers 
    df_cam['clean_camp'] = df_cam.apply(lambda x: re.sub('[^A-Za-z]+', '',x['origin']), axis=1) 

    #Get unique values and calculate similarity 
    uq_origin = np.unique(df_cam['clean_camp'].values.ravel()) 
    top_match = [process.extractOne(x, df_dict['Shortcut'])[0] for x in uq_origin] 

    #To DataFrame 
    df_match = pd.DataFrame({'unique': uq_origin}) 
    df_match['top_match'] = top_match 

    #Merge 
    df_cam = pd.merge(df_cam, df_match, how = 'left', left_on = 'clean_camp', right_on = 'unique') 
    df_cam = pd.merge(df_cam, df_dict, how = 'left', left_on = 'top_match', right_on = 'Shortcut') 

    return df_cam 

df_out = TopFuzzMatch(df_cam, df_dict)

来源

2016-11-29 09:15:47 HonzaB

慢两种DataFrames

回答

相关问题