2017-03-16 89 views
0

我想从正则表达式产生的结果中创建一个熊猫数据框中的新列。pandas函数中的正则表达式

我期待的结果是:

In[1]: df 
Out[1]: 

    valueProduct valueService  totValue 
0  $465580.99  $322532.34 $788113.33 

我的数据框dtypes是:

df.dtypes 

Contracting Office Name    object 
Contracting Office Region    object 
PIID         object 
PIID Agency ID      object 
Major Program       object 
Description of Requirement   object 
Referenced IDV PIID     object 
Completion Date    datetime64[ns] 
Prepared By       object 
Funding Office Name     object 
Funding Agency ID      object 
Funding Agency Name     object 
Funding Office ID      object 
Effective Date    datetime64[ns] 
Fiscal Year       int64 
Ultimate Contract Value    float64 
Count         int64 

1行中题为“要求的说明”一栏有如下的长字符串值(在这一列中的相似字符串值通过数据集):

管理员添加额外的体积和道路工作变化银滑道监护项目 - ALLEGHENY国家产品的森林VALUE =服务$ 465580.99 VALUE =合同的$ 322532.34总额= $ 788113.33

我想成功地写一个正则表达式从这个字符串中提取3项,但仅产生新列的美元价值:

VALUE OF PRODUCT = $465580.99 
VALUE OF SERVICE = $322532.34 
TOTAL VALUE OF CONTRACT = $788113.33 

下面的代码做这个假设在数据帧的字符串进行一个简单的字符串值数据框之外:

text = "STEWARDSHIP ADD ADDITIONAL VOLUME AND ROAD WORK CHANGES SILVER SLIDE STEWARDSHIP PROJECT - ALLEGHENY NATIONAL FOREST VALUE OF PRODUCT = $465580.99 VALUE OF SERVICE = $322532.34 TOTAL VALUE OF CONTRACT = $788113.33" 


pattern = re.compile('(VALUE OF PRODUCT).{1,3}\$\d*\.\d*', re.IGNORECASE) 
getPattern = re.search(pattern, text) 
print (getPattern.group()) 

将产生:

VALUE OF PRODUCT = $465580.99 

我可以为其他两个项目重复此操作。

现在,感觉我在一个数据帧的工作我试图做类似如下:

def valProduct(row): 
    pattern = re.compile('(VALUE OF PRODUCT).{1,3}\$\d*\.\d*', re.IGNORECASE) 
    findPattern = re.search(pattern, row['Description of Requirement']) 
    return findPatter 

df['valueProduct'] = df.apply(lambda row: valProduct(row), axis=1) 

In[2]: sf[['valueProduct']][:1] 
Out[2]: None 

这将产生一个新的列,但其空,但应该至少是表明:

VALUE OF PRODUCT = $465580.99 

任何帮助,非常感谢!

回答

1
import re  

text = "STEWARDSHIP ADD ADDITIONAL VOLUME AND ROAD WORK CHANGES SILVER SLIDE STEWARDSHIP PROJECT - ALLEGHENY NATIONAL FOREST VALUE OF PRODUCT = $465580.99 VALUE OF SERVICE = $322532.34 TOTAL VALUE OF CONTRACT = $788113.33" 

re.findall(r'value.+?\d\b',text, re.I) 

输出

['VALUE OF PRODUCT = $465580', 'VALUE OF SERVICE = $322532', 'VALUE OF CONTRACT = $788113']