2013-07-01 54 views
2

我想在不使用BeautifulSoup的情况下从python的html文件中提取标签。例如,我想使用python从html文件中提取标签

class="el" href="atsc__root__raised__cosine.html" target="_self">atsc_root_raised_cosine 

<a class="el" href="atsc__root__raised__cosine.html" target="_self">atsc_root_raised_cosine</a> 

任何想法?

+0

为什么你不想使用BeautifulSoup?可能有一个很好的理由,但是如果你可以包含这些信息,那么这个问题会让其他人更加有用。 –

+0

这不是一个标签,它只是HTML的一个片段。你想要做什么? –

回答

1

为了做基本的dom解析,你可以在stl中使用xml parser

这里是用它打开XML转换为HTML(从文档)的例子:

import xml.dom.minidom 

document = """\ 
<slideshow> 
<title>Demo slideshow</title> 
<slide><title>Slide title</title> 
<point>This is a demo</point> 
<point>Of a program for processing slides</point> 
</slide> 

<slide><title>Another demo slide</title> 
<point>It is important</point> 
<point>To have more than</point> 
<point>one slide</point> 
</slide> 
</slideshow> 
""" 

dom = xml.dom.minidom.parseString(document) 

def getText(nodelist): 
    rc = [] 
    for node in nodelist: 
     if node.nodeType == node.TEXT_NODE: 
      rc.append(node.data) 
    return ''.join(rc) 

def handleSlideshow(slideshow): 
    print "<html>" 
    handleSlideshowTitle(slideshow.getElementsByTagName("title")[0]) 
    slides = slideshow.getElementsByTagName("slide") 
    handleToc(slides) 
    handleSlides(slides) 
    print "</html>" 

def handleSlides(slides): 
    for slide in slides: 
     handleSlide(slide) 

def handleSlide(slide): 
    handleSlideTitle(slide.getElementsByTagName("title")[0]) 
    handlePoints(slide.getElementsByTagName("point")) 

def handleSlideshowTitle(title): 
    print "<title>%s</title>" % getText(title.childNodes) 

def handleSlideTitle(title): 
    print "<h2>%s</h2>" % getText(title.childNodes) 

def handlePoints(points): 
    print "<ul>" 
    for point in points: 
     handlePoint(point) 
    print "</ul>" 

def handlePoint(point): 
    print "<li>%s</li>" % getText(point.childNodes) 

def handleToc(slides): 
    for slide in slides: 
     title = slide.getElementsByTagName("title")[0] 
     print "<p>%s</p>" % getText(title.childNodes) 

handleSlideshow(dom) 
1

看一看这个XML API在python提供的,它说明了如何访问属性,元素和具有一定的HTML也是例子。您也可以生成解析器对象。

相关问题