用美丽的汤和Python解析元标记

我用美丽的汤3和Python 2.6解析HTML页面时遇到了问题。用美丽的汤和Python解析元标记

的HTML内容是这样的：

content='<div class="egV2_EventReportCardLeftBlockShortWidth"> 
<span class="egV2_EventReportCardTitle">When</span> 
<span class="egV2_EventReportCardBody"> 
<meta itemprop="startDate" content="2012-11-23T10:00:00.0000000"> 
<span class='egV2_archivedDateEnded'>STARTS</span>Fri 23 Nov,10:00AM<br/> 
<meta itemprop="endDate" content="2012-12-03T18:00:00.0000000"> 
<span class='egV2_archivedDateEnded'>ENDS</span>Mon 03 Dec,6:00PM</span> 
<span class="egV2_EventReportCardBody"></span> 
<div class="egV2_div_cal" onclick=" showExportEvent()"> 
<div class="egV2_div_cal_outerFix"> 
<div class="egV2_div_cal_InnerAdjust"> Cal </div> 
</div></div></div>'

而且我想要得到的字符串“周五10年11月23日00：00”出中间到一个变量，用于串联，并发送回PHP页。

要阅读本内容，我使用以下代码：（上述内容来自html页面阅读（http://everguide.com.au/melbourne/event/2012-nov-23/life-与鸟弹簧仓库销售/）

import urllib2 
req = urllib2.Request(URL) 
response = urllib2.urlopen(req) 
html = response.read() 
from BeautifulSoup import BeautifulSoup 
soup = BeautifulSoup(html.decode('utf-8')) 
soup.prettify() 
import re 
for node in soup.findAll(itemprop="name"): 
    n = ''.join(node.findAll(text=True)) 
for node in soup.findAll("div", { "class" : "egV2_EventReportCardLeftBlockShortWidth" }): 
    d = ''.join(node.findAll(text=True)) 
print n,"|", d

将返回：

[(ssh user)]# python testscrape.py 

LIFE with BIRD Spring Warehouse Sale | 
When 
<span class="egV2_EventReportCardDateTitle">STARTS</span> 
STARTSFri 23 Nov,10:00AMENDSMon 03 Dec,6:00PM 
<span class="egV2_EventReportCardDateTitle">ENDS</span> 



Cal 



[(ssh user)]#

（它包括所有的断行等）

所以你可以看到那里的。结束，我分组这两个字符串被剥离为一个打印输出，中间有一个分隔字符，可以将字符串读回为一个字符串，然后将其拆分。

问题是 - python代码可以读取该页面并存储文本，但它包含所有垃圾和标签等，这些都会让PHP应用程序混淆。

我真的只是想退换：由于IM使用的findAll（文= true）方法

Fri 23 Nov,10:00AM

是什么呢？

如何深入了解并仅获取该文本中的文本 - 不是span标签？

任何帮助将不胜感激，谢谢。

瑞克 - 墨尔本。

来源

2012-11-25 itsricky

为什么不尝试像

In [95]: soup = BeautifulSoup(content) 

In [96]: soup.find("span", {"class": "egV2_archivedDateEnded"}) 
Out[96]: <span class="egV2_archivedDateEnded">STARTS</span> 

In [97]: soup.find("span", {"class": "egV2_archivedDateEnded"}).next 
Out[97]: u'STARTS' 

In [98]: soup.find("span", {"class": "egV2_archivedDateEnded"}).next.next 
Out[98]: u'Fri 23 Nov,10:00AM'

甚至

In [99]: soup.find("span", {"class": "egV2_archivedDateEnded"}).nextSibling 
Out[99]: u'Fri 23 Nov,10:00AM'

来源

2012-11-25 23:32:25 DSM

真棒！我甚至曾想过使用Next兄弟！它不完全是最简单的文档（BS4）通读！干杯。 – itsricky

如果你只是试图提取，很容易与特定的属性标识的单个标签，pyparsing使得这个非常简单（我会去后meta标记与它的ISO8601时间字符串值）：

from pyparsing import makeHTMLTags,withAttribute 

meta = makeHTMLTags('meta')[0] 
# only want matching <meta> tags if they have the attribute itemprop="startDate" 
meta.setParseAction(withAttribute(itemprop="startDate")) 

# scanString is a generator that yields (tokens,startloc,endloc) triples, we just 
# want the tokens 
firstmatch = next(meta.scanString(content))[0]

现在转换为datetime对象，它可以格式化你喜欢的任何方式，写入数据库，用于计算经过的时间，等：

from datetime import datetime 
dt = datetime.strptime(firstmatch.content[:19], "%Y-%m-%dT%H:%M:%S") 

print (firstmatch.content) 
print (dt)

打印：

2012-11-23T10:00:00.0000000 
2012-11-23 10:00:00

来源

2013-06-23 02:38:06 PaulMcG

用美丽的汤和Python解析元标记

回答

相关问题