2017-06-13 22 views
3

我在名为'DESCRIPTION'的数据框内有一个文本列。我需要找到单词“tile”或“tiles”在单词“roof”的6个单词内的所有实例,然后将单词“tile/s”更改为“rooftiles”。我需要为“floor”和“tiles”(将“tiles”更改为“floortiles”)做同样的事情。这将有助于区分当某些词语与其他词语结合使用时,我们正在查看的建筑物贸易。如果该字在另一个字的特定字数内,则替换该字符串的一个字

为了说明我的意思,数据和我的最新尝试不正确的一个例子是:

s1=pd.Series(["After the storm the roof was damaged and some of the tiles are missing"]) 
s2=pd.Series(["I dropped the saw and it fell on the floor and damaged some of the tiles"]) 
s3=pd.Series(["the roof was leaking and when I checked I saw that some of the tiles were cracked"]) 
df=pd.DataFrame([list(s1), list(s2), list(s3)], columns = ["DESCRIPTION"]) 
df 

我以后应该是这个样子(在数据帧格式)的解决方案:

1.After the storm the roof was damaged and some of the rooftiles are missing  
2.I dropped the saw and it fell on the floor and damaged some of the floortiles 
3.the roof was leaking and when I checked I saw that some of the tiles were cracked 

这里我尝试使用REGEX模式来替换单词“瓷砖”,但它是完全错误的......是否有办法做我想做的事情?我是新来的Python ...

regex=r"(roof)\b\s+([^\s]+\s+){0,6}\b(.*tiles)" 
replacedString=re.sub(regex, r"(roof)\b\s+([^\s]+\s+){0,6}\b(.*rooftiles)", df['DESCRIPTION']) 

UPDATE:解决方案

感谢所有帮助!我设法使用Jan的代码和一些附加/调整工作。最终的工作代码是低于(使用真实的,不是例子,文件和数据):

claims_file = pd.read_csv(project_path + claims_filename) # Read input file 
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].fillna('NA') #get rid of encoding errors generated because some text was just 'NA' and it was read in as NaN 
#create the REGEX  
rx = re.compile(r''' 
     (      # outer group 
      \b(floor|roof)  # floor or roof 
      (?:\W+\w+){0,6}\s* # any six "words" 
     ) 
     \b(tiles?)\b   # tile or tiles 
     ''', re.VERBOSE) 

#create the reverse REGEX 
rx2 = re.compile(r''' 
     (      # outer group 
      \b(tiles?)  # tile or tiles 
      (?:\W+\w+){0,6}\s* # any six "words" 
     ) 
     \b(floor|roof)\b   # roof or floor 
     ''', re.VERBOSE) 
#apply it to every row of Loss Description: 
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].apply(lambda x: rx.sub(r'\1\2\3', x)) 

#apply the reverse regex: 
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].apply(lambda x: rx2.sub(r'\3\1\2', x)) 

# Write results into CSV file and check results 
claims_file.to_csv(project_path + output_filename, index = False 
         , encoding = 'utf-8') 
+1

你可以发布你想要的输出吗? – void

回答

2

替换它,您可以使用一个解决方案在这里正则表达式中删除:

(      # outer group 
    \b(floor|roof)  # floor or roof 
    (?:\W+\w+){1,6}\s* # any six "words" 
) 
\b(tiles?)\b   # tile or tiles 

a demo for the regex on regex101.com


后来,只是结合捕获备件和 rx.sub()再次把它们放在一起,并应用此向 DESCRIPTION列的所有物品,让你最终有以下代码:

import pandas as pd, re 

s1 = pd.Series(["After the storm the roof was damaged and some of the tiles are missing"]) 
s2 = pd.Series(["I dropped the saw and it fell on the floor and damaged some of the tiles"]) 
s3 = pd.Series(["the roof was leaking and when I checked I saw that some of the tiles were cracked"]) 

df = pd.DataFrame([list(s1), list(s2), list(s3)], columns = ["DESCRIPTION"]) 

rx = re.compile(r''' 
      (      # outer group 
       \b(floor|roof)  # floor or roof 
       (?:\W+\w+){1,6}\s* # any six "words" 
      ) 
      \b(tiles?)\b   # tile or tiles 
      ''', re.VERBOSE) 

# apply it to every row of "DESCRIPTION" 
df["DESCRIPTION"] = df["DESCRIPTION"].apply(lambda x: rx.sub(r'\1\2\3', x)) 
print(df["DESCRIPTION"]) 


请注意尽管你的原始问题不是很清楚:这个解决方案只会在 roof之后找到 tiletiles ,这意味着像Can you give me the tile for the roof, please?这样的句子将不匹配(尽管tile这个词在ran即来自roof的六个字的ge,即)。

+0

谢谢Jan!这工作完美!我明白你对REGEX的意思不是两种工作方式......我通过简单地运行代码两次找到了一种解决方法......不确定这是否是最好的方式去实现它,但它看起来像工作!我发布了用作更新的最终代码 – KMM

2

我会告诉你一个快速和肮脏的不完整的实现。你当然可以使它更强大和有用。比方说,s是你的描述之一:

s = "I dropped the saw and it fell on the roof and damaged roof " +\ 
    "and some of the tiles" 

让我们先来打破它变成文字(记号化,就可以消除标点符号,如果你想):

​​

现在,选择感兴趣的标记和排序按字母顺序,但要记住它们原来的位置在s

my_tokens = sorted((w.lower(), i) for i,w in enumerate(tokens) 
        if w.lower() in ("roof", "tiles")) 
#[('roof', 6), ('roof', 12), ('tiles', 17)] 

结合相同的标记,并创建一个字典,这个标记是钥匙,一他们的职位清单是价值观。使用字典解析:

token_dict = {name: [p0 for _, p0 in pos] 
       for name,pos 
       in itertools.groupby(my_tokens, key=lambda a:a[0])} 
#{'roof': [9, 12], 'tiles': [17]} 

经过tiles位置的列表中,如果有的话,看看是否有一个roof附近,如果是的话,换个词:

for i in token_dict['tiles']: 
    for j in token_dict['roof']: 
     if abs(i-j) <= 6: 
      tokens[i] = 'rooftiles' 

最后,把字再次合并:

' '.join(tokens) 
#'I dropped the saw and it fell on the roof and damaged roof '+\ 
#' and some of the rooftiles' 
+0

谢谢DYZ!我得到了这个测试集的工作,但是当我试图在我的csv文件上运行时,我遇到了一些麻烦......我发现Jan的解决方案更容易实现 – KMM

0

您遇到的主要问题是在您的正则表达式的瓷砖前的。*。这使得任何数量的任何角色都可以进入并且仍然匹配。 \ b是不必要的,因为无论如何它们都处于空白和非空白之间的边界。而分组()也没有被使用,所以我删除了它们。 “(屋顶\ s + [^ \ s] + \ s +){0,6}地砖”将仅匹配地砖的6个“单词”(由空白分隔的非空白字符的组)内的屋顶。要替换它,从正则表达式中除了最后5个字符的匹配字符串,添加“rooftiles”,然后用匹配的字符串替换更新后的字符串。或者,你可以在正则表达式中将除了()以外的所有东西都分组,然后用自己加上“屋顶”来替换该组。您不能将re.sub用于复杂的事情,因为它会将屋顶上的整个匹配替换为瓦片,而不仅仅是瓦片。

1

我可以概括这比“屋顶”和“地板”多子,但是这似乎是一个简单的代码:(“”)

for idx,r in enumerate(df.loc[:,'DESCRIPTION']): 
    if "roof" in r and "tile" in r: 
     fill=r[r.find("roof")+4:] 
     fill = fill[0:fill.replace(' ','_',7).find(' ')] 
     sixWords = fill if fill.find('.') == -1 else '' 
     df.loc[idx,'DESCRIPTION'] = r.replace(sixWords,sixWords.replace("tile", "rooftile")) 
    elif "floor" in r and "tile" in r: 
     fill=r[r.find("floor")+5:] 
     fill = fill[0:fill.replace(' ','_',7).find(' ')] 
     sixWords = fill if fill.find('.') == -1 else '' 
     df.loc[idx,'DESCRIPTION'] = r.replace(sixWords,sixWords.replace("tile", "floortile")) 

注意,这还包括一个句号的检查。您可以通过删除sixWords变量与fill

+0

Thankyou的帮助!但是我得到这个代码的错误:TypeError:'float'类型的参数是不可迭代的 – KMM

相关问题