2014-02-21 90 views
0

我有一个HTML文件具有吨的相对href链接;Python的正则表达式来提取相对的href链接

href="data/self/dated/station1_140208.txt">Saturday, February 08, 2014/a>br/> 

有吨的文件中的其他HTTP和FTP链接,
我需要的输出txt文件;

14/02/08: station1_140208.txt 
14/02/09: station1_140209.txt 
14/02/10: station1_140210.txt 
14/02/11: station1_140211.txt 
14/02/12: station1_140212.txt 

我试图写我自己的,但我需要很长时间才适应Python正则表达式。
我可以打开源文件,应用我还找不出来的特定正则表达式,然后将其写回到磁盘。

我需要你在正则表达式的帮助。 谢谢。

+0

使用DOM来提取所有链接,并在检查相关链接之后。 –

回答

0
pattern = 'href="data/self/dated/([^"]*)"[^>]*>([\s\S]*?)</a>' 

测试:

import re 
s = """ 
<a href="data/self/dated/station1_140208.txt">Saturday, February 08, 2014</a> 
br/> 
<a href="data/self/dated/station1_1402010.txt">Saturday, February 10, 2014</a> 
br/> 
<a href="data/self/dated/station1_1402012.txt">Saturday, February 12, 2014</a> 
br/> 
""" 
pattern = 'href="data/self/dated/([^"]*)"[^>]*>([\s\S]*?)</a>' 
re.findall(pattern,s) 

输出:

[('station1_140208.txt', 'Saturday, February 08, 2014'), ('station1_1402010.txt', 'Saturday, February 10, 2014'), ('station1_1402012.txt', 'Saturday, February 12, 2014')] 
+0

非常感谢Kowalski,它的确如我所期待的那样。 – user3335418

2

我知道这不完全是你问什么,但我想我会显示您的链接转换日期的方式文本转换为您在所需输出示例中显示的格式(日/月/年)。我用BeautifulSoup从html中读取元素。

from bs4 import BeautifulSoup 
import datetime as dt 
import re 

html = '<a href="data/self/dated/station1_140208.txt">Saturday, February 08, 2014</a><br/>' 

p = re.compile(r'.*/station1_\d+\.txt') 

soup = BeautifulSoup(html) 

a_tags = soup.find_all('a', {"href": p}) 

>>> print a_tags # would be a list of all a tags in the html with relevant href attribute 
[<a href="data/self/dated/station1_140208.txt">Saturday, February 08, 2014</a>] 

names = [str(a.get('href')).split('/')[-1] for a in a_tags] #str because they will be in unicode 

dates = [dt.datetime.strptime(str(a.text), '%A, %B %m, %Y') for a in a_tags] 

名字和日期使用list comprehensions

strptime创造出的日期字符串的

>>> print names # would be a list of all file names from hrefs 
['station1_140208.txt'] 

>>> print dates # would be a list of all dates as datetime objects 
[datetime.datetime(2014, 8, 1, 0, 0)] 

toFileData = ["{0}: {1}".format(dt.datetime.strftime(d, '%w/%m/%y'), n) for d in dates for n in names] 

strftime重新格式化的日期到您的格式,比如datetime对象:

>>> print toFileData 
['5/08/14: station1_140208.txt'] 

然后写入en尝试在toFileData到一个文件

有关我用如soup.find_all()a.get()在上面的代码的方法的信息,我建议你通过在顶部的链接看看BeautifulSoup文档。希望这可以帮助。