提取多个图案，并将其保存到熊猫数据帧[巨蟒]

我的文本文件看起来像这样提取多个图案，并将其保存到熊猫数据帧[巨蟒]

Description: Text 1 follows <br/> blah blah blah Cause: Cause Text 1 
follows here <br/>Description: Text 2 follows <br/> blah blah 
blah Cause: Cause Text 2 follows here<br/>Description: Text 3 follows <br/> 
blah blah blah Description: Text 4 follows <br/> blah blah 
blah Cause: Cause Text 4 follows<br/>

我想拥有的所有说明，并导致了NLP结构化格式的熊猫数据帧

Description    Cause 
Text 1 follows  Cause Text 1 follows here 
Text 2 follows  Cause Text 2 follows here 
Text 3 follows  
Text 4 follows  Cause Text 4 follows here

我迄今所做的：

re.findall(r'Description:(.*?)<br/>',textfile) 
re.findall(r'Cause:(.*?)<br/>',textfile)

但是，这并不让我垫当我尝试创建更大的数据框时，说明和原因！

感谢您的任何输入或指导做同样的事情。对python很新颖！

来源

2017-02-16 0Ajax0

尝试['R'说明（S）：（？：P （:(？
））\ S *。*）
（:(:(?!说明:)？。）*？原因：\ s *（？P （？:(?!
）。）*））？'']（https://regex101.com/r/bRIOev/1） –

这是我想出来的。

r"Description:(.*?)<br/>(?:(?!Cause)(?!Description).)*(?:Cause:(.*?)<br/>)?"

如果你使用这个表达式，它匹配既是Description和可选Cause，它将确保描述和原因的配对保持“拉链”正确。

data = re.findall(r"Description:(.*?)<br/>(?:(?!Cause)(?!Description).)*(?:Cause:(.*?)<br/>)?", textfile) 
df = pandas.DataFrame(data, columns=("Description", "Cause"))

来源

2017-02-16 07:34:05

完美:)谢谢！ – 0Ajax0

提取多个图案，并将其保存到熊猫数据帧[巨蟒]

回答

相关问题