2017-02-28 65 views
1

我发布了一个“第1部分”的问题,让我回到我需要的功能here的答案,但认为这证明了自己的问题。如果不是,我会删除。案例不敏感的替换(映射)

我想将一个函数应用于一个数据框,该数据框将全称替换为缩写(New York -> NY)。然而,我注意到在我的数据集中,如果一个国家是大写字母,它显然不会匹配该字幕。我试图解决它,但似乎无法破解密码:

import pandas as pd 
import numpy as np 
dfp = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,5,np.NaN], 
        'B' : [1,0,3,5,0,0,np.NaN,9,0,0], 
        'C' : ['Pharmacy of IDAHO','NY Pharma','NJ Pharmacy','Idaho Rx','CA Herbals','Florida Pharma','AK RX','Ohio Drugs','PA Rx','USA Pharma'], 
        'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN], 
        'E' : ['Assign','Unassign','Assign','Ugly','Appreciate','Undo','Assign','Unicycle','Assign','Unicorn',]}) 

import us 
statez = us.states.mapping('abbr', 'name') 
inv_map = {v: k for k, v in statez.items()} 

def replace_states(company): 
    # find all states that exist in the string 
    state_found = filter(lambda state: state.lower() in company.lower(), statez.values()) 

    # replace each state with its abbreviation 
    for state in state_found: 
     print(state, inv_map[state]) 
     company = company.replace(state, inv_map[state]) 
     print("---" , company) 

    # return the modified string (or original if no states were found) 
    return company 

dfp['C'] = dfp['C'].map(replace_states) 

输出:注意缺少变化的“爱达荷药房”

Idaho ID 
--- Pharmacy of IDAHO 
Idaho ID 
--- ID Rx 
Florida FL 
--- FL Pharma 
Ohio OH 
--- OH Drug 

有没有一种方法,使这个函数不区分大小写?

回答

0

与他们的缩写来代替国家的名称(不区分大小写矢量化解决方案):

t1 = dfp.C.str.split(expand=True) 
t2 = t1.stack().str.title().map(inv_map).unstack() 
t1[t2.notnull()] = t2 
dfp['new'] = t1.stack().groupby(level=0).agg(' '.join) 

结果:

In [152]: x 
Out[152]: 
    A B     C   D   E    new 
0 NaN 1.0 Pharmacy of IDAHO  123456.0  Assign Pharmacy of ID 
1 NaN 0.0   NY Pharma  123456.0 Unassign  NY Pharma 
2 3.0 3.0  NJ Pharmacy 1234567.0  Assign  NJ Pharmacy 
3 4.0 5.0   Idaho Rx 12345678.0  Ugly   ID Rx 
4 5.0 0.0   CA Herbals  12345.0 Appreciate  CA Herbals 
5 5.0 0.0  Florida Pharma  12345.0  Undo  FL Pharma 
6 3.0 NaN    AK RX 12345678.0  Assign   AK RX 
7 1.0 9.0   Ohio Drugs 123456789.0 Unicycle  OH Drugs 
8 5.0 0.0    PA Rx 1234567.0  Assign   PA Rx 
9 NaN 0.0   USA Pharma   NaN  Unicorn  USA Pharma 

说明:

In [155]: t1 = dfp.C.str.split(expand=True) 

In [156]: t1 
Out[156]: 
      0   1  2 
0 Pharmacy  of IDAHO 
1  NY Pharma None 
2  NJ Pharmacy None 
3  Idaho  Rx None 
4  CA Herbals None 
5 Florida Pharma None 
6  AK  RX None 
7  Ohio  Drugs None 
8  PA  Rx None 
9  USA Pharma None 

In [157]: t2 = t1.stack().str.title().map(inv_map).unstack() 

In [158]: t2 
Out[158]: 
    0 1  2 
0 NaN NaN ID 
1 NaN NaN None 
2 NaN NaN None 
3 ID NaN None 
4 NaN NaN None 
5 FL NaN None 
6 NaN NaN None 
7 OH NaN None 
8 NaN NaN None 
9 NaN NaN None 

In [159]: t1[t2.notnull()] = t2 

In [160]: t1 
Out[160]: 
      0   1  2 
0 Pharmacy  of ID 
1  NY Pharma None 
2  NJ Pharmacy None 
3  ID  Rx None 
4  CA Herbals None 
5  FL Pharma None 
6  AK  RX None 
7  OH  Drugs None 
8  PA  Rx None 
9  USA Pharma None 

更换状态缩写与他们的名字(不区分大小写矢量化解决方案):

In [88]: dfp['state'] = dfp.C.str.extract(r'\b([A-Z]{2})\b', expand=False) 

In [89]: dfp 
Out[89]: 
    A B     C   D   E state 
0 NaN 1.0 Pharmacy of IDAHO  123456.0  Assign NaN 
1 NaN 0.0   NY Pharma  123456.0 Unassign NY 
2 3.0 3.0  NJ Pharmacy 1234567.0  Assign NJ 
3 4.0 5.0   Idaho Rx 12345678.0  Ugly NaN 
4 5.0 0.0   CA Herbals  12345.0 Appreciate CA 
5 5.0 0.0  Florida Pharma  12345.0  Undo NaN 
6 3.0 NaN    AK RX 12345678.0  Assign AK 
7 1.0 9.0   Ohio Drugs 123456789.0 Unicycle NaN 
8 5.0 0.0    PA Rx 1234567.0  Assign PA 
9 NaN 0.0   USA Pharma   NaN  Unicorn NaN 

In [90]: dfp.C = dfp.C.replace(dfp.state.tolist(), 
           dfp.state.map(statez).tolist(), 
           regex=True) 

In [91]: dfp 
Out[91]: 
    A B     C   D   E state 
0 NaN 1.0 Pharmacy of IDAHO  123456.0  Assign NaN 
1 NaN 0.0  New York Pharma  123456.0 Unassign NY 
2 3.0 3.0 New Jersey Pharmacy 1234567.0  Assign NJ 
3 4.0 5.0    Idaho Rx 12345678.0  Ugly NaN 
4 5.0 0.0 California Herbals  12345.0 Appreciate CA 
5 5.0 0.0  Florida Pharma  12345.0  Undo NaN 
6 3.0 NaN   Alaska RX 12345678.0  Assign AK 
7 1.0 9.0   Ohio Drugs 123456789.0 Unicycle NaN 
8 5.0 0.0  Pennsylvania Rx 1234567.0  Assign PA 
9 NaN 0.0   USA Pharma   NaN  Unicorn NaN 
+0

我知道它有点违反直觉,但我实际上想从完整的国家名称到缩写版本。例如:'Ohio - > OH' – MattR

+0

@MattR,嗯。 ..,这使得它更具挑战性。让我尝试另一种解决方案... – MaxU

+0

已经进行了一些编辑,所以我不确定我之前发布了哪些内容,但第一部分完全符合我的需求。但是,我不知道你是怎么做到的!但它非常出色。任何解释都会很棒,但并不需要尊重你的时间和帮助! – MattR

0

我会找到它的指数,然后用它来替换它不区分大小写:

# replace each state with its abbreviation 
    for state in state_found: 
     print(state, inv_map[state]) 
     index = company.lower().find(state.lower()) 
     company = company.replace(company[index:index + len(state)], inv_map[state]) 
     print("---" , company) 

这保留的情况下该字符串的所有其他部分。

+0

我为我的困惑表示歉意,但你能解释一下在何处放置此代码,也许解释它背后的原因?当我把它放在我的循环中时,我得到了疯狂的输出。 – MattR

+0

@MattR我已经添加了其他代码来帮助您放置它。如果不正确,请让我知道你得到的输出。 – TemporalWolf

+0

我添加了一些示例代码,以便海报可以使用我的测试数据框。但这里是我的电流输出'爱达荷州ID --- IDPIDhIDaIDrIDmIDaIDcIDyID IDoIDfID IDIIDDIDAIDHIDOID 爱达荷州ID --- IDIIDdIDaIDhIDoID IDRIDxID' – MattR