如何使用Python从HTML文档中提取信息？

我需要python从HTML文件中提取一些数据。如何使用Python从HTML文档中提取信息？

我使用目前的代码波纹管：

import urllib 
recent = urllib.urlopen(http://gamebattles.majorleaguegaming.com/ps4/call-of-duty-ghosts/team/TeamCrYpToNGamingEU/match?id=46057240) 
recentsource = recent.read()

我现在需要这个，然后打印在该网页上其他球队的表中的玩家标签的列表。

我该怎么做？

感谢

来源

2014-09-28 crossboy007

使用beautifulsoup：http://www.crummy.com/software/BeautifulSoup/ – 2014-09-28 02:01:43

看那Beautiful Soup模块，这是一个美妙的文本分析器。

如果您不想或不能安装它，您可以下载源代码，并将.py文件放在与您的程序相同的目录中。

为此，请从网站下载并提取代码，并将“bs4”目录复制到与python脚本相同的文件夹中。

然后，把这个在你的代码的开头：

from bs4 import BeautifulSoup 
# or 
from bs4 import BeautifulSoup as bs 
# To type bs instead of BeautifulSoup every single time you use it

你可以学习如何从其他计算器问题，使用它，或者看documentation

来源

2014-09-28 02:06:25 Electron

您可以使用html2text这个工作或你可以使用ntlk。

一个示例代码

import nltk 
from urllib import urlopen 
url = "http://any-url"  
html = urlopen(url).read() 
raw = nltk.clean_html(html) 

print(raw)

来源

2014-09-28 02:14:20

pyparsing有拉从HTML网页数据的一些有用的构造，其结果往往是自我建构和自我命名（如果设置了解析器/扫描器正确）。下面是该特定网页的pyparsing解决方案：

from pyparsing import * 

# for stripping HTML tags 
anyTag,anyClose = makeHTMLTags(Word(alphas,alphanums+":_")) 
commonHTMLEntity.setParseAction(replaceHTMLEntity) 
stripHTML = lambda tokens: (commonHTMLEntity | Suppress(anyTag | anyClose)).transformString(''.join(tokens))    

# make pyparsing expressions for HTML opening and closing tags 
# (suppress all from results, as there is no interesting content in the tags or their attributes) 
h3,h3End = map(Suppress,makeHTMLTags("h3")) 
table,tableEnd = map(Suppress,makeHTMLTags("table")) 
tr,trEnd = map(Suppress,makeHTMLTags("tr")) 
th,thEnd = map(Suppress,makeHTMLTags("th")) 
td,tdEnd = map(Suppress,makeHTMLTags("td")) 

# nothing interesting in column headings - parse them, but suppress the results 
colHeading = Suppress(th + SkipTo(thEnd) + thEnd) 

# simple routine for defining data cells, with optional results name 
colData = lambda name='' : td + SkipTo(tdEnd)(name) + tdEnd 

playerListing = Group(tr + colData() + colData() + 
         colData("username") + 
         colData().setParseAction(stripHTML)("role") + 
         colData("networkID") + 
         trEnd) 

teamListing = (h3 + ungroup(SkipTo("Match Players" + h3End, failOn=h3))("name") + "Match Players" + h3End + 
       table + tr + colHeading*5 + trEnd + 
       Group(OneOrMore(playerListing))("players")) 



for team in teamListing.searchString(recentsource): 
    # use this to print out names and structures of results 
    #print team.dump() 
    print "Team:", team.name 
    for player in team.players: 
     print "- %s: %s (%s)" % (player.role, player.username, player.networkID) 
     # or like this 
     # print "- %(role)s: %(username)s (%(networkID)s)" % player 
    print

打印：

Team: Team CrYpToN Gaming EU 
- Leader: CrYpToN_Crossy (CrYpToN_Crossy) 
- Captain: Juddanorty (CrYpToN_Judd) 
- Member: BLaZe_Elfy (CrYpToN_Elfy) 

Team: eXCeL™ 
- Leader: Caaahil (Caaahil) 
- Member: eSportsmanship (eSportsmanship) 
- Member: KillBoy-NL (iClown-x)

来源

2014-09-28 03:45:03 PaulMcG

如何使用Python从HTML文档中提取信息？

回答

相关问题