2016-08-10 108 views
0

在下面的代码中,符号字符串re.sub('<[^>]*>|[\n]|\[[0-9]*\]', '', htmlread)的每个元素是什么意思?Python网页抓取,符号含义

import urllib2 
import re 

htmltext = urllib2.urlopen("https://en.wikipedia.org/wiki/Linkin_Park") 
htmlread = htmltext.read() 
htmlread = re.sub('<[^>]*>|[\n]|\[[0-9]*\]', '', htmlread) 
regex = '(?<=Linkin Park was founded)(.*)(?=the following year.)' 
pattern = re.compile(regex) 
htmlread = re.findall(pattern, htmlread) 
print "Linkin Park was founded" + htmlread[0] + "the following year." 
+1

http://stackoverflow.com/questions/22937618/参考 - 什么 - 做 - 这正则表达式均值 –

回答

0

线htmlread = re.sub('<[^>]*>|[\n]|\[[0-9]*\]', '', htmlread)去除要么

  • <> OR
  • 换行符之间的表达
  • 括号或空括号

从htmlread之间的数

有趣维基张贴在这里:Reference - What does this regex mean?

0

替换“”的每一个字符,这意味着从htmlread可变

删除,请阅读更多关于正则表达式