soup = BeautifulSoup(your_data)
uploaded = []
link_data = []
for f in soup.findAll("font", {"class":"detDesc"}):
uploaded.append(f.contents[0])
link_data.append(f.a.contents[0])
例如,使用以下数据:
your_data = """
<font class="detDesc">Uploaded 10-29 18:50, Size 4.36 GiB, ULed by <a class="detDesc" href="/user/NLUPPER002/" title="Browse NLUPPER002">NLUPPER002</a></font>
<div id="meow">test</div>
<font class="detDesc">Uploaded 10-26 19:23, Size 1.16 GiB, ULed by <a class="detDesc" href="/user/NLUPPER002/" title="Browse NLUPPER002">NLUPPER003</a></font>
"""
运行上面的代码为您提供:
>>> print uploaded
[u'Uploaded 10-29 18:50, Size 4.36 GiB, ULed by ', u'Uploaded 10-26 19:23, Size 1.16 GiB, ULed by ']
>>> print link_data
[u'NLUPPER002', u'NLUPPER003']
来获取文本的确切形式,正如你所说,您可以后处理的列表或循环自身内部分析数据。例如:
>>> [",".join(x.split(",")[:2]).replace(" ", " ") for x in uploaded]
[u'Uploaded 10-29 18:50, Size 4.36 GiB', u'Uploaded 10-26 19:23, Size 1.16 GiB']
附:如果你是列表中理解的粉丝,该解决方案可以作为表达一个班轮:
output = [(f.contents[0], f.a.contents[0]) for f in soup.findAll("font", {"class":"detDesc"})]
这给了你:
>>> output # list of tuples
[(u'Uploaded 10-29 18:50, Size 4.36 GiB, ULed by ', u'NLUPPER002'), (u'Uploaded 10-26 19:23, Size 1.16 GiB, ULed by ', u'NLUPPER003')]
>>> uploaded, link_data = zip(*output) # split into two separate lists
>>> uploaded
(u'Uploaded 10-29 18:50, Size 4.36 GiB, ULed by ', u'Uploaded 10-26 19:23, Size 1.16 GiB, ULed by ')
>>> link_data
(u'NLUPPER002', u'NLUPPER003')
正则表达式可以帮助你轻松做到这一点。 –
@JohnRiselvato不,正则表达式几乎从来不是解析XML/HTML的好解决方案 –
我可以将它转储为JSON,但仍然无法解决我的解决方案,因为此页面的HTML编写得不好。或者我想! – Hick