从熊猫数据帧中提取字符串

我在这里再次希望找到解决我的编码噩梦。我有一本词典term_dict，其中包含术语列表作为键和术语类别作为值。还有一个带有ID和Notes列的数据框data。任务是在每data.ID记录中使用term_dict在data.Notes中查找匹配项。从熊猫数据帧中提取字符串

term_dict{     
    Ibuprofen 800mg  :  Drug 
    Hip Replacement Surgery : Treatment 
    Tylenol AM   : Drug 
    Mild Dislocation  : Treatment 
    Advil     : Drug 
    Fractured Tibia  : Treatment 
    Quinone    : Drug 
    Fever     : Treatment 
    Penicillin 250mg  : Drug 
    Histerectomy   : Treatment 
    Surgical removal of bunion : Treatment 
    Therapy    : Treatment 
    Bunion    : Treatment 
    Hospita X    : Location 
    mg     : Dosage 
    stop     : Exclusion 
} 

data: 
ID  Notes       
604  Take 2 tablets of advil & 3 caps of pen 
     250mg twice daily       
602  Stop pen but cont. with advil 
     as needed for the fracture 
210  2 tabs of Tyl 3x daily for 5 days   
607  nan 
700  surgery scheduled for 01/01/2017 
515  nan          
019  Call my office if bunion pain persist  
     after 3 days 
604  f/up appt. @Hospital X

到目前为止，这是我的代码：

lists = [] 
for s in data['Notes']: 
    cleanNotes = " " + " ".join(re.split(r'[^a-z 0-9]|[w/]',s.lower())) + " " 
    for k, v in term_dict.items(): 
     k = " %s "%k 
     if k in cleanNotes and v != exclusion: 
      if k in cleanNotes and v == 'drug': 
       lists.append(k) 
       data['Drug'] = ':'.join(str(lists)) 
      elif k in cleanNotes and v == 'location': 
       lists.append(k) 
       data['Location'] = ' '.join(str(lists)) 
      elif k in cleanNotes and v == 'treatment': 
       lists.append(k) 
       data['Treatment'] = ':'.join(str(lists)) 
      elif k in cleanNotes and v == 'dosage': 
       lists.append(k) 
       data['Dosage'] = ':'.join(str(lists)) 
     else: 
      for s in data.Notes: 
      matches = list(datefinder.find_dates(s.lower())) 
      data['Date'] = ', '.join([str(dates) for dates in matches])

....我的输出没有什么期望，因为代码只是填充从他过去的记录与匹配数据帧的新列数据帧的：

data: 
ID  Notes          Drug    Dosage  Location  Treatment Date     
604  Take 2 tablets of advil & 3 caps of pen  advil       Hospital X 
     250mg twice daily       
602  Stop pen but cont. with advil    advil       Hospital X 
     as needed for the fracture 
210  2 tabs of Tyl 3x daily for 5 days   advil  
607  nan           advil 
700  surgery scheduled for 01/01/2017   advil            
515  nan           advil 
019  Call my office if bunion pain persist  advil            
     after 3 days 
604  f/up appt. @Hospital X. cont w/advil  advil       Hospital X

***但是预期输出：

data: 
ID  Notes          Drug    Dosage  Location  Treatment Date     
604  Take 2 tablets of advil & 3 caps of pen  advil:penicilin  0:250mg 
     250mg twice daily       
602  Stop pen but cont. with advil    advil           fracture 
     as needed for the fracture 
210  2 tabs of Tyl 3x daily for 5 days   Tylenol 
607  nan 
700  surgery scheduled for 01/01/2017               surgery  01/01/2017 
515  nan          
019  Call my office if bunion pain persist              bunion 
     after 3 days 
604  f/up appt. @Hospital X. cont w/advil  advil       Hospital X

如果我能修复这个重复，我将不胜感激。谢谢！

来源

2017-07-07 CodeLearner

什么是“单词”？你为什么使用它？ –

@COLDSPEED - 这是Notes列中每个备注的干净版本。清洁 - 意思是不含任何/所有特殊字符 – CodeLearner

你错误的本质就是这个。您分配该列的每个元素为相同的值：

In [114]: import pandas as pd 

In [115]: df = pd.DataFrame(np.random.randn(50, 4), columns=list('ABCD')) 

In [116]: df.head() 
Out[116]: 
      A   B   C   D 
0 -0.896291 -0.277551 0.926559 0.522212 
1 -0.265559 -1.300435 -0.079514 -1.083569 
2 -0.534509 0.298264 -1.361829 0.750666 
3 0.318937 -0.407164 0.080020 0.499435 
4 -0.161574 -1.012471 0.631092 1.368540 

In [117]: df['NewCol'] = 'something here' 

In [119]: df.head() 
Out[119]: 
      A   B   C   D   NewCol 
0 -0.896291 -0.277551 0.926559 0.522212 something here 
1 -0.265559 -1.300435 -0.079514 -1.083569 something here 
2 -0.534509 0.298264 -1.361829 0.750666 something here 
3 0.318937 -0.407164 0.080020 0.499435 something here 
4 -0.161574 -1.012471 0.631092 1.368540 something here

为了解决这个问题，你可以做的是前面创建空列，就像这样：

In [120]: df = pd.DataFrame(np.random.randn(50, 1), columns=['Notes']) 

In [121]: df['Drug'] = "" 
    ...: df['Location'] = "" 
    ...: df['Treatment'] = "" 
    ...: df['Dosage'] = "" 
    ...: 

In [122]: df.head() 
Out[122]: 
     Notes Drug Location Treatment Dosage 
0 0.325993        
1 -0.561066        
2 0.555040        
3 0.001332        
4 0.400009

当通过Notes循环，使用枚举循环：

for i, s in enumerate(data['Notes']):

然后，当需要时，只需设置合适的细胞：

df.set_value(i, 'Drug', 'advil')

来源

2017-07-07 20:00:13

@COLDSPEED我尝试了您的示例解决方案，并且出现了此错误：ValueError：无法从重复轴重新索引 – CodeLearner

@CodeLearner您会更新您的代码吗？我会看看... –

@CodeLearner嗯，似乎你的数据框有重复索引，无论出于何种原因，阻止你在那里插入这些值（不是我的代码的副产品，除非它与我们的熊猫有关版本不同）。你可以在循环前添加这行：'data = data.reset_index（drop = True）'我认为它应该可以工作。 –

从熊猫数据帧中提取字符串

回答

相关问题