2016-11-25 48 views
0

之间的模糊匹配我的数据框A(df_cam)与CLI ID和原产地:与快捷方式和活动慢两种DataFrames

cli id |   origin 
------------------------------------ 
123 | 1234 M-MKT XYZklm 05/2016 

而且数据框B(df_dict

shortcut |   campaign 
------------------------------------ 
M-MKT | Mobile Marketing Outbound 

我知道,示例客户端来源1234 M-MKT XYZklm 05/2016实际上来自广告系列Mobile Marketing Outbound,因为它包含关键字M-MKT

注意,快捷方式是一般的关键词,基于算法应该决定什么。起源也可以是M-Marketing,MMKTMob-MKT。我首先通过分析所有来源手动创建了快捷方式列表。我也使用正则表达式来清除origin,然后将其提取到程序中。

我想通过快捷与运动相匹配的客户来源和重视分数看出区别。正如下面显示:

cli id | shortcut |   origin   |  campaign   | Score 
--------------------------------------------------------------------------------- 
123 | M-MKT | 1234 M-MKT XYZklm 05/2016 | Mobile Marketing Outbound | 0.93 

下面是我的程序,它的工作原理,但真正。 DataFrame A具有〜400.000行,另一个DataFrame B具有〜40行。

有没有办法让我的速度更快?

from fuzzywuzzy import fuzz 
list_values = df_dict['Shortcut'].values.tolist() 

def TopFuzzMatch(tokenA, dict_, position, value): 
    """ 
    Calculates similarity between two tokens and returns TOP match and score 
    ----------------------------------------------------------------------- 
    tokenA: similarity to this token will be calculated 
    dict_a: list with shortcuts 
    position: whether I want first, second, third...TOP position 
    value: 0=similarity score, 1=associated shortcut 
    ----------------------------------------------------------------------- 
    """ 
    sim = [(fuzz.token_sort_ratio(x, tokenA),x) for x in dict_] 
    sim.sort(key=lambda tup: tup[0], reverse=True) 
    return sim[position][value] 

df_cam['1st_choice_short'] = df_cam.apply(lambda x: TopFuzzMatch(x['cli_origin'],list_values,0,1), axis=1) 
df_cam['1st_choice_sim'] = df_cam.apply(lambda x: TopFuzzMatch(x['cli_origin'],list_values,0,0), axis=1) 

请注意,我还想计算第2次和第3次最佳匹配以评估准确性。

编辑

我发现process.ExtractOne方法,但速度保持不变。 所以我的代码看起来像现在这样:

def TopFuzzMatch(token, dict_, value): 
    score = process.extractOne(token, dict_, scorer=fuzz.token_sort_ratio) 
    return score[value] 

回答

1

我找到了解决办法 - 在我干净的正则表达式起源柱(没有数字和特殊字符),也有只有几百重复不同的值,所以我计算Fuzz算法就在那些上,这大大缩短了时间。

def TopFuzzMatch(df_cam, df_dict): 
    """ 
    Calculates similarity bewteen two tokens and return TOP match 
    The idea is to do it only over distinct values in given DF (takes ages otherwise) 
    ----------------------------------------------------------------------- 
    df_cam: DataFrame with client id and origin 
    df_dict: DataFrame with abbreviation which is matched with the description i need 
    ----------------------------------------------------------------------- 
    """ 
    #Clean special characters and numbers 
    df_cam['clean_camp'] = df_cam.apply(lambda x: re.sub('[^A-Za-z]+', '',x['origin']), axis=1) 

    #Get unique values and calculate similarity 
    uq_origin = np.unique(df_cam['clean_camp'].values.ravel()) 
    top_match = [process.extractOne(x, df_dict['Shortcut'])[0] for x in uq_origin] 

    #To DataFrame 
    df_match = pd.DataFrame({'unique': uq_origin}) 
    df_match['top_match'] = top_match 

    #Merge 
    df_cam = pd.merge(df_cam, df_match, how = 'left', left_on = 'clean_camp', right_on = 'unique') 
    df_cam = pd.merge(df_cam, df_dict, how = 'left', left_on = 'top_match', right_on = 'Shortcut') 

    return df_cam 

df_out = TopFuzzMatch(df_cam, df_dict)