错误：使用的findall <pre><code>regex = '<title>(.+?)</title>' pattern = re.compile(regex) </code></pre> ，然后搜索模式： <pre><code>titles = re.findall(pattern,html) print(titles) </code></pre> 不能像一个字节对象

我正在使用Python 3.2.3本规范上使用字符串模式html对象从特定的url获取html代码。错误：使用的findall <pre><code>regex = '<title>(.+?)</title>' pattern = re.compile(regex) </code></pre> ，然后搜索模式： <pre><code>titles = re.findall(pattern,html) print(titles) </code></pre> 不能像一个字节对象

html = response.read()

我得到错误“无法在字节状对象上使用字符串模式”。我曾尝试使用：

regex = b'<title>(.+?)</title>'

但附加一个“b”我的结果？谢谢。

来源

2014-02-27 Nikhil

什么是'html'和[你为什么使用正则表达式来解析HTML？]（http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-标签/ 1732454＃1732454） –

什么是html对象？尝试使用'str（html）'。怎么了？ – slezica

您推荐Ignacio的Python的哪个HTML解析器？ – Nikhil

urllib.request回复给你字节，而不是unicode字符串。这就是为什么re模式也需要成为bytes对象，并且您再次得到bytes结果。

可以解码使用服务器给你上的HTTP报头中的编码响应：

html = response.read() 
# no codec set? We default to UTF-8 instead, a reasonable assumption 
codec = response.info().get_param('charset', 'utf8') 
html = html.decode(codec)

现在你有Unicode和能够使用unicode正则表达式了。

如果服务器说谎了编码或没有编码设置，并且UTF-8的默认值也不正确，上述错误仍然会导致UnicodeDecodeException错误。

在任何情况下，用b'...'表示的返回值都是bytes对象;尚未解码为Unicode的原始字符串数据，如果您知道正确的数据编码，则无需担心。

来源

2014-02-27 23:14:42

这代表读写字符串数据时的一般规则：在读取Unicode时将输入解码为Unicode，在编写Unicode字符串之前对其进行编码。程序中的所有文本都应该用Unicode处理。 – holdenweb

错误：使用的findall</p> <pre><code>regex = '<title>(.+?)</title>' pattern = re.compile(regex) </code></pre> <p>，然后搜索模式：</p> <pre><code>titles = re.findall(pattern,html) print(titles) </code></pre> <p>不能像一个字节对象

回答

错误：使用的findall</p> <pre><code>regex = '<title>(.+?)</title>' pattern = re.compile(regex) </code></pre> <p>，然后搜索模式：</p> <pre><code>titles = re.findall(pattern,html) print(titles) </code></pre> <p>不能像一个字节对象

回答

相关问题