str.extract从熊猫数据帧后

我有一个数据帧与数千行的两列像这样开始：str.extract从熊猫数据帧后

          string  state 
0  the best new york cheesecake rochester ny   ny 
1  the best dallas bbq houston tx random str   tx 
2 la jolla fish shop of san diego san diego ca   ca 
3         nothing here   dc

对于每一个状态，我把所有的城市名的正则表达式（小写案例）结构像(city1|city2|city3|...)其中城市的秩序是任意的（但可以根据需要更改）。例如，纽约州的正则表达式包含'new york'和'rochester'（对于德克萨斯州同样为'dallas'和'houston'，对于加利福尼亚州同样为'san diego'和'la jolla'）。

我想找出字符串中最后出现的城市是什么（用于观察1，2，3，4，我会分别'rochester'，'houston'，'san diego'和NaN（或其他），希望）。

我从str.extract开始，并试图想像颠倒弦线但陷入僵局。

非常感谢您的帮助！

来源

2017-09-04 user49007

您可以使用str.findall，但如果没有匹配得到空list，所以需要申请。最后通过[-1]选择字符串的最后一个项目：

cities = r"new york|dallas|rochester|houston|san diego" 

print (df['string'].str.findall(cities) 
        .apply(lambda x: x if len(x) >= 1 else ['no match val']) 
        .str[-1]) 
0  rochester 
1   houston 
2  san diego 
3 no match val 
Name: string, dtype: object

（更正> = 1到> 1）

另一种解决方案是有点劈 - 通过radd添加不匹配的字符串启动每个字符串和添加这个字符串到城市也是：

a = 'no match val' 
cities = r"new york|dallas|rochester|houston|san diego" + '|' + a 

print (df['string'].radd(a).str.findall(cities).str[-1]) 
0  rochester 
1   houston 
2  san diego 
3 no match val 
Name: string, dtype: object

来源

2017-09-04 06:35:42 jezrael

第一个解决方案已经足够好了;谢谢！ – user49007

@ user49007 - 感谢您的纠正。 – jezrael

cities = r"new york|dallas|..." 

def last_match(s): 
    found = re.findall(cities, s) 
    return found[-1] if found else "" 

df['string'].apply(last_match) 
#0 rochester 
#1  houston 
#2 san diego 
#3

来源

2017-09-04 06:08:44 DyZ

str.extract从熊猫数据帧后

回答

相关问题