处理HTML文件Python

我对html不太了解...... 如何从页面中删除文本？例如，如果HTML页面读取为：处理HTML文件Python

<meta name="title" content="How can I make money at home online? No gimmacks please? - Yahoo! Answers"> 
<title>How can I make money at home online? No gimmicks please? - Yahoo! Answers</title>

我只是想提取此。

How can I make money at home online? No gimmicks please? - Yahoo! Answers

我重新使用功能：

def striphtml(data): 
    p = re.compile(r'<.*?>') 
    return p.sub(' ',data)

但仍没有做什么，我想让它做..？

上述功能被称为：

for lines in filehandle.readlines(): 

     #k = str(section[6].strip()) 
     myFile.write(lines) 

     lines = striphtml(lines) 
     content.append(lines)

来源

2012-01-09 Fraz

可能重复http://stackoverflow.com/questions/717541/parsing-html-in- python），[使用Python处理HTML文件]（http://stackoverflow.com/q/7694637） – Sathya 2012-01-09 02:45:43

检查此问题：http://stackoverflow.com/questions/328356/extracting-text-from-html-file - 使用的Python – mgibsonbr 2012-01-09 02:47:15

不要使用正则表达式的HTML/XML解析。改为尝试http://www.crummy.com/software/BeautifulSoup/。

from BeautifulSoup import BeautifulSoup 
soup = BeautifulSoup('Your resource<title>hi</title>') 
soup.title.string # Your title string.

来源

2012-01-09 02:47:46

我通常使用http://lxml.de/进行html解析！它非常容易使用，并且非常容易获得标签，您可以使用它的xpath！这使得事情变得简单和快速。

我使用的一个例子，在一个剧本，我没有读一个xml饲料和算的话：

https://gist.github.com/1425228

您也可以找到文档中更多的例子： http://lxml.de/lxmlhtml.html

来源

2012-01-09 02:56:31

为此使用一个html解析器。其中一个可能是BeautifulSoup

获得页面的文本内容：

from BeautifulSoup import BeautifulSoup 


soup = BeautifulSoup(your_html) 
text_nodes = soup.findAll(text = True) 
retult = ' '.join(text_nodes)

[解析HTML在Python（的

来源

2012-01-09 02:58:21 soulcheck

处理HTML文件Python

回答

相关问题