2011-08-06 141 views
0

要在这里开始是完整我当前的代码:解析HTML表格

import urllib 
from BeautifulSoup import BeautifulSoup 
import sgmllib 
import re 

page = 'http://www.sec.gov/Archives/edgar/data/\ 
8177/000114036111018563/form10k.htm' 

sock = urllib.urlopen(page) 
raw = sock.read() 
soup = BeautifulSoup(raw) 

tablelist = soup.findAll('table') 

class MyParser(sgmllib.SGMLParser): 

def parse(self, segment): 
    self.feed(segment) 
    self.close() 

def __init__(self, verbose=0): 
    sgmllib.SGMLParser.__init__(self, verbose) 
    self.descriptions = [] 
    self.inside_td_element = 0 
    self.starting_description = 0 

def start_td(self, attributes): 
    for name, value in attributes: 
     if name == "valign": 
      self.inside_td_element = 1 
      self.starting_description = 1 
     else: 
      self.inside_td_element = 1 
      self.starting_description = 1 

def end_td(self): 
    self.inside_td_element = 0 

def handle_data(self, data): 
    if self.inside_td_element: 
     if self.starting_description: 
      self.descriptions.append(data) 
      self.starting_description = 0 
     else: 
      self.descriptions[-1] += data 

def get_descriptions(self): 
    return self.descriptions 

counter = 0 
trlist = [] 
dtablelist = [] 

while counter < len(tablelist): 
    trsegment = tablelist[counter].findAll('tr') 
    trlist.append(trsegment) 
    strsegment = str(trsegment) 
    myparser = MyParser() 
    myparser.parse(strsegment) 
    sub = myparser.get_descriptions() 
    dtablelist.append(sub) 
    counter = counter + 1 

ex = [] 

dtablelist = [s for s in dtablelist if s != ex] 

所以我想要完成的任务是采取从HTML文档中的所有表,然后重新打印到一个Excel电子表格。所以,当我创建trlist输出看起来是这样的:

print trlist[1] 
[<tr> 
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT- SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">&#160;</font></td> 
<td valign="top" width="25%"> 
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">Title of each class</font></div> 
</td> 
<td valign="top" width="25%" style="TEXT-ALIGN: center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">Name of exchange</font></td> 
<td valign="top" width="25%" style="TEXT-ALIGN: center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">&#160;</font></td> 
</tr>, <tr> 
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman">&#160;</font></td> 
<td valign="top" width="25%"> 
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"><font style="DISPLAY: inline; FONT-WEIGHT: bold">Common Stock, par value</font> </font></div> 
</td> 
<td valign="top" width="25%"> 
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"> 
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"><font style="FONT-WEIGHT: bold"><font style="FONT-WEIGHT: bold"><  <font style="FONT-WEIGHT: bold">NASDAQ Global Market</font></font></font></font></div> 
</div> 
</td> 
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman">&#160;</font></td> 
</tr>,... 

正如你可以看到trlist每个产品每个单排桌子的这是我想要的(。)。但是,当我通过我的sgmllib中解析器来检索标签之间的内容运行每个trlist项目我得到这个输出:

print dtablelist[1] 
['\nTitle of each class\n', 'Name of exchange', '\nCommon Stock, par value\n', '\n\nNASDAQ Global Market\n\n', '\n$1.00 per share\n'] 

正如你可以看到,输出是每个内容作为自己个人的字符串,而不是每个表格行()的内容列表。所以基本上我想要的输出:

[['\nTitle of each class\n', 'Name of exchange'], ['\nCommon Stock, par value\n', '\n\nNASDAQ Global Market\n\n'], ['\n$1.00 per share\n']] 

是不是因为我必须把trlist成字符串之前,我与MyParser解析呢?有谁知道任何解决方法,让我解析列表内的列表(又名先知狗屎)?

+0

为什么你使用两个不同的解析器,而不是使用BeautifulSoup的整个事情? (你为什么要两次导入BeautifulSoup?) – kindall

+0

导入BeautifulSoup两次是一个错误。此外,我正在使用sgmllib来解析,因为当我这样做时:trsegment = tablelist [counter] .findAll('tr')。这将返回一个列表类型输出,而不是标签或BeautifulSoup类型的输出。 – kr21

回答

2

使用lxml.html

>>> import lxml.html 
>>> data = ["<tr><td>test</td><td>help</td></tr>", "<tr><td>data1</td><td>data2</td></tr>"] 
>>> [lxml.html.fromstring(tr).xpath(".//text()") for tr in data] 
[['test', 'help'], ['data1', 'data2']] 

这里是一些更完整的代码。它将文本存储在包含表格列表的列表中,每个表格都有一个tr列表,每个tr都有一个所有文本的列表。

import urllib 
import lxml.html 

data = urllib.urlopen('http://www.sec.gov/Archives/edgar/data/8177/000114036111018563/form10k.htm').read() 
tree = lxml.html.fromstring(data) 

tables = [] 
for tbl in tree.iterfind('.//table'): 
    tele = [] 
    tables.append(tele) 
    for tr in tbl.iterfind('.//tr'): 
     text = [e.strip() for e in tr.xpath('.//text()') if len(e.strip()) > 0] 
     tele.append(text) 

print tables 

希望这会有所帮助,欢呼!

+0

是的,这正是我所期待的,非常感谢! – kr21

1

如果有人搜索相同问题的解决方案,而是使用Python 3:

您不必使用外部库,即使您正在使用python 3.在解析HTML表SGMLParser类被html.parser替换为HTMLParser。我已经编写了一个简单派生的HTMLParser类的代码。它是here in a github repo。它只记得当前<td>,<tr><table>标签的范围。与使用etree相比,它的优势在于它可以在不符合xml规范的html上正确运行,并且不会使用外部库。

您可以使用类(这里命名HTMLTableParser)方式如下:

import urllib.request 
from html_table_parser import HTMLTableParser 

target = 'http://www.twitter.com' 

# get website content 
req = urllib.request.Request(url=target) 
f = urllib.request.urlopen(req) 
xhtml = f.read().decode('utf-8') 

# instantiate the parser and feed it 
p = HTMLTableParser() 
p.feed(xhtml) 
print(p.tables) 

的这个输出是代表表2D-列表的列表。它看起来可能是这样的:

[[[' ', ' Anmelden ']], 
[['Land', 'Code', 'Für Kunden von'], 
    ['Vereinigte Staaten', '40404', '(beliebig)'], 
    ['Kanada', '21212', '(beliebig)'], 
    ... 
    ['3424486444', 'Vodafone'], 
    [' Zeige SMS-Kurzwahlen für andere Länder ']]]