Python解析帮助

-3

有人可以帮我解析一下吗？我有很大的麻烦。我正在解析这个site的信息。Python解析帮助

下面是几行代码从表中提取数据与2个冠军和4个值：

for x in soup.findAll(attrs={'valign':'top'}): 
       print(x.contents) 
       make_list = x.contents 
       print(make_list[1]) #trying to select one of the values on the list.

当我尝试与make_list[1]行打印出来，它会得到一个错误。但是，如果我拔出最后2行，我会以列表格式获得我想要的html，但我似乎无法分开单个或筛选它们（取出html标记）。任何人都可以帮忙吗？

这里是一个输出示例，我想在这里具体说明。我不知道正确的正则表达式：

['\n', <td align="left">Western Mutual/Residence <a href="http://interactive.web.insurance.ca.gov/companyprofile/companyprofile?event=companyProfile&amp;doFunction=getCompanyProfile&amp;eid=3303"><small>(Info)</small></a></td>, '\n', <td align="left"><div align="right">           355</div></td>, '\n', <td align="left"><div align="right">250</div></td>, '\n', <td align="left"> </td>, '\n', <td align="left">Western Mutual/Residence <a href="http://interactive.web.insurance.ca.gov/companyprofile/companyprofile?event=companyProfile&amp;doFunction=getCompanyProfile&amp;eid=3303"><small>(Info)</small></a></td>, '\n', <td align="left"><div align="right">           320</div></td>, '\n', <td align="left"><div align="right">500</div></td>, '\n']

来源

2015-09-11 Kenny Truong

什么是预期输出 – The6thSense

“它得到一个错误”。什么是错误？ – Kevin

@Kevin IndexError：列表索引超出范围 –

如果你试图解析从该网站的结果，下面应该工作：

from bs4 import BeautifulSoup 

html_doc = ....add your html.... 
soup = BeautifulSoup(html_doc, 'html.parser') 
rows = [] 
tables = soup.find_all('table') 
t2 = None 

# Find the second from last table 
for t3 in tables: 
    t1, t2 = t2, t3 

for row in t1.find_all('tr'): 
    cols = row.find_all(['td', 'th']) 
    cols = [col.text.strip() for col in cols] 
    rows.append(cols) 

# Collate the two columns 
data = [cols[0:3] for cols in rows] 
data.extend([cols[4:7] for cols in rows[1:]]) 

for row in data: 
    print "{:40} {:15} {}".format(row[0], row[1], row[2])

这给了我输出看起来像：

Company Name        Annual Premium Deductible 
AAA (Interinsurance Exchange) (Info)  N/A    250 
Allstate (Info)       315    250 
American Modern (Info)     N/A    250 
Amica Mutual (Info)      259    250 
Bankers Standard (Info)     N/A    250 
California Capital (Info)    160    250 
Century National (Info)     N/A    250 
.....

它是如何工作的？

由于网页主要是显示一个表格，所以这是我们需要找到的，所以第一步是获取表格列表。

该网站已使用多个表的部分。至少在请求之间页面的结构可能会保持不变。

我们需要的表格几乎是页面上的最后一个（但不是最后一个），所以我决定遍历可用的表格并从最后一个中选择第二个。 t1t2t3只是一个工作，以保持迭代过程中的最后一个值。

从这里HTML表通常有一个相当标准的结构，TR和TD。这一个也使用了TH作为标题行。使用这个table BeautifulSoup然后允许我们枚举所有的行。

随着每一行，我们可以找到所有的列。如果您打印返回的内容，您将看到每行的所有条目，然后可以看到需要使用哪些索引对其进行分片。

他们已将输出显示在两个列组中，中间有一个空白列。我构建了两个列表，用于提取两组列，然后将第二组附加到第一组的底部以供显示。

来源

2015-09-11 12:33:41

OMG谢谢我会试试这个......这与我的想法完全不同......但我不明白你如何在网页中找到t1，t2，t3？你是怎么做到的？找到那些让我可以知道未来桌子的东西？谢谢你，我会尝试这个，但让你知道它是如何工作的:) –

你怎么知道专门找'td'和'th'？我一直在做的是右键单击并检查元素，并试图查看和理解该代码大声笑。 –

就像你如何得到数字0,3,4,7，40,15？哈哈抱歉打扰你问...但也谢谢你！ –

Python解析帮助

回答

相关问题