我正在尝试从几篇文章中提取日期。当我测试正则表达式时,模式只匹配部分感兴趣的信息。正如你可以看到: https://regex101.com/r/ATgIeZ/2正则表达式|从文本中提取日期
这是文本文件的样本:
|[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded two police officers with a knife in Brussels around...] 3004
[<p>Advertisement , By DAVID JOLLY FEB. 8, 2016
, KABUL, Afghanistan — A Taliban suicide bomber killed at least three people on Mo JULY 14, 2034
提取模式,我使用和代码是这一个:
import re
text_open = open("News_cleaned_definitive.csv")
text_read = text_open.read()
pattern = ("[A-Z]+\.*\s(\d+)\,\s(\d+){4}")
result = re.findall(pattern,text_read)
print(result)
而来自Anaconda的输出是:
[('5', '6'), ('7', '5'), ('1', '6'), .....]
预期的输出是:
OCT. 5, 2016, FEB. 8, 2016, JULY 14, 2034 .....
圆括号之间的组只匹配数字。什么是预期的输出(也是,你的正则表达式在regextester是不同的) –