我在名为'DESCRIPTION'的数据框内有一个文本列。我需要找到单词“tile”或“tiles”在单词“roof”的6个单词内的所有实例,然后将单词“tile/s”更改为“rooftiles”。我需要为“floor”和“tiles”(将“tiles”更改为“floortiles”)做同样的事情。这将有助于区分当某些词语与其他词语结合使用时,我们正在查看的建筑物贸易。如果该字在另一个字的特定字数内,则替换该字符串的一个字
为了说明我的意思,数据和我的最新尝试不正确的一个例子是:
s1=pd.Series(["After the storm the roof was damaged and some of the tiles are missing"])
s2=pd.Series(["I dropped the saw and it fell on the floor and damaged some of the tiles"])
s3=pd.Series(["the roof was leaking and when I checked I saw that some of the tiles were cracked"])
df=pd.DataFrame([list(s1), list(s2), list(s3)], columns = ["DESCRIPTION"])
df
我以后应该是这个样子(在数据帧格式)的解决方案:
1.After the storm the roof was damaged and some of the rooftiles are missing
2.I dropped the saw and it fell on the floor and damaged some of the floortiles
3.the roof was leaking and when I checked I saw that some of the tiles were cracked
这里我尝试使用REGEX模式来替换单词“瓷砖”,但它是完全错误的......是否有办法做我想做的事情?我是新来的Python ...
regex=r"(roof)\b\s+([^\s]+\s+){0,6}\b(.*tiles)"
replacedString=re.sub(regex, r"(roof)\b\s+([^\s]+\s+){0,6}\b(.*rooftiles)", df['DESCRIPTION'])
UPDATE:解决方案
感谢所有帮助!我设法使用Jan的代码和一些附加/调整工作。最终的工作代码是低于(使用真实的,不是例子,文件和数据):
claims_file = pd.read_csv(project_path + claims_filename) # Read input file
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].fillna('NA') #get rid of encoding errors generated because some text was just 'NA' and it was read in as NaN
#create the REGEX
rx = re.compile(r'''
( # outer group
\b(floor|roof) # floor or roof
(?:\W+\w+){0,6}\s* # any six "words"
)
\b(tiles?)\b # tile or tiles
''', re.VERBOSE)
#create the reverse REGEX
rx2 = re.compile(r'''
( # outer group
\b(tiles?) # tile or tiles
(?:\W+\w+){0,6}\s* # any six "words"
)
\b(floor|roof)\b # roof or floor
''', re.VERBOSE)
#apply it to every row of Loss Description:
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].apply(lambda x: rx.sub(r'\1\2\3', x))
#apply the reverse regex:
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].apply(lambda x: rx2.sub(r'\3\1\2', x))
# Write results into CSV file and check results
claims_file.to_csv(project_path + output_filename, index = False
, encoding = 'utf-8')
你可以发布你想要的输出吗? – void