剥空间

我有一个包含一些html标签如下字符串：剥空间

"<p> This is a test </p>"

我要剥去标签之间的所有多余的空格。我曾尝试以下：

In [1]: import re 

In [2]: val = "<p> This is a test </p>" 

In [3]: re.sub("\s{2,}", "", val) 
Out[3]: '<p>This is atest</p>' 

In [4]: re.sub("\s\s+", "", val) 
Out[4]: '<p>This is atest</p>' 

In [5]: re.sub("\s+", "", val) 
Out[5]: '<p>Thisisatest</p>'

，但我没能得到期望的结果，即<p>This is a test</p>

我怎样才能达致这？

来源

2013-11-23 Amyth

尝试

re.sub(r'\s+<', '<', val) 
re.sub(r'>\s+', '>', val)

然而，这是一般的实际使用，其中brokets不一定总是如果一个标签部分过于简单化。（认为<code>块，<script>块等）您应该使用适当的HTML解析器来处理类似的任何事情。

来源

2013-11-23 11:34:23 tripleee

尝试使用HTML解析器像BeautifulSoup：

from bs4 import BeautifulSoup as BS 
s = "<p> This is a test </p>" 
soup = BS(s) 
soup.find('p').string = ' '.join(soup.find('p').text.split()) 
print soup

<p>This is a test</p>

来源

2013-11-23 11:35:43 TerryA

这可能会帮助：

import re 

val = "<p> This is a test </p>" 
re_strip_p = re.compile("<p>|</p>") 

val = '<p>%s</p>' % re_strip_p.sub('', val).strip()

来源

2013-11-23 11:37:55 flyer

你可以试试这个：

re.sub(r'\s+(</)|(<[^/][^>]*>)\s+', '$1$2', val);

来源

2013-11-23 11:38:14

从这个问题，我看到你正在使用一个非常具体的HTML字符串来解析。虽然正则表达式很快而且很脏，但是its not recommend -- use a XML parser instead。注意：XML比HTML更严格。所以，如果你觉得你可能没有XML，就像@Haidro所建议的那样使用BeautifulSoup。

对于你的情况，你会做这样的事情：

>>> import xml.etree.ElementTree as ET 
>>> p = ET.fromstring("<p> This is a test </p>") 
>>> p.text.strip() 
'This is a test' 
>>> p.text = p.text.strip() # If you want to perform more operation on the string, do it here. 
>>> ET.tostring(p) 
'<p>This is a test</p>'

来源

2013-11-23 11:41:50 SuperSaiyan

s = '<p> This is a test </p>' 
s = re.sub(r'(\s)(\s*)', '\g<1>', s) 
>>> s 
'<p> This is a test </p>' 
s = re.sub(r'>\s*', '>', s) 
s = re.sub(r'\s*<', '<', s) 
>>> s 
'<p>This is a test</p>'

来源

2013-11-23 11:47:45 ndpu

回答

相关问题