使用正则表达式从文章中获取信息

我使用正则表达式和美丽的汤来从文章中获取信息。我目前似乎无法从输出中得到我所需要的。对于日期，我只需要获取列表中返回的第一个实例。我尝试了遍历列表，但还没有太多运气。对于作者而言，我想剪出一个href标签，只是取得它的名字而不是整个返回的字符串。我尝试了一个循环并更改了一些正则表达式调用，但一直无法缩小范围。任何指导将不胜感激。下面是相关代码：使用正则表达式从文章中获取信息

import urllib2 
from bs4 import BeautifulSoup 
import re 
from time import * 

url: http://www.reuters.com/article/2014/02/26/us-afghanistan-usa-militants-idUSBREA1O1SV20140226 

# Parse HTML of article, aka making soup 
soup = BeautifulSoup(urllib2.urlopen(url).read()) 

# Write the article author to the file  
regex = '<p class="byline">(.+?)</p>' 
pattern = re.compile(regex) 
byline = re.findall(pattern,str(soup)) 
txt.write("Author: " + str(byline) + '\n' + '\n') 

# Write the article date to the file  
regex = '<span class="timestamp">(.+?)</span>' 
pattern = re.compile(regex) 
byline = re.findall(pattern,str(soup)) 
txt.write("Date: " + str(byline) + '\n' + '\n')

来源

2014-02-27 user3285763

你根本不需要regex，使用BeautifulSoup！并且日期位于url的最后8个字符中。 –

你能提供一个例子说明如何使用bs4抓取作者吗？我读过美丽的汤文件，他们的方法没有产生所需的输出。尽管我对python很陌生，所以很可能是我的一个误解。 – user3285763

您可以使用BeautifulSoup抢你需要使用几乎任何你所描述的，只是没有正则表达式相同的方法到底是什么。既然你知道了标签您感兴趣的特点，你可以搜索他们直接使用BS4的find

#make some soup 
soup = BeautifulSoup(urllib2.urlopen(url).read()) 

#extract byline and date text from their respective tags 
try: 
    byline=soup.find("p", {'class':'byline'}).text 
    date=soup.find("span", {'class':'timestamp'}).text 
except: 
    print 'byline missing!'

修订：如果你包裹在一个try/except结构整个事情，你可以解决的情况下将byline缺失并定义一些应该发生的替代操作。

来源

2014-02-28 04:27:00 nickhamlin

你先生是一个伟大的人。我一直在为此苦苦挣扎。感谢您花时间帮助初学者。 – user3285763

如果byline不存在，它会中断代码。有没有办法检查该陈述是否属实？或者让它没有任何回报，并继续不断？ – user3285763

你打赌，看看上面的更新。 – nickhamlin

使用正则表达式从文章中获取信息

回答

相关问题