用Python重新排列解析的HTML数据

我几乎没有编程经验，所以请原谅我的无知。用Python重新排列解析的HTML数据

我试图从雅虎解析'重要统计'页面。财务，具体要this页面。我一直在与BeautifulSoup玩弄，并且能够提取我想要的数据，但之后就陷入了精神障碍。我想数据显示如下：

measure[i]: value[i] 
. 
. 
measure[n]: value[n]

，但我有我的脚本得到的结果是：

measure[i] 
. 
.  
measure[n] 
value[i] 
. 
. 
value[n]

这是我加入两个数据字段的探索与尝试，其引发错误：

measure = soup.findAll('td', {'class':'yfnc_tablehead1'}, width='74%') 
value = soup.findAll('td', {'class':'yfnc_tabledata1'}) 

for incident in measure: 
    x = incident.contents 

for incident2 in value: 
    y = incident2.contents 

data = x + y 

print ': '.join(data)

此外，我想删除这些值中有不需要的字符，但我会阅读re.compile和re.sub文档。

谢谢你的任何意见。

来源

2012-02-14 user1205632

data = x + y

的+运营商追加列表，如果你想对夫妇对应列表的项目尝试zip()功能：

data = zip(x,y) 
for m,v in data: 
    print m,v

也

for incident in measure: 
    x = incident.contents

这将覆盖x在每次迭代的循环，所以最后x只包含分配的最后一个值，而不是它们的集合所有。在这里你可能想使用+运营商，像这样：

for incident in measure: 
    x += incident.contents # x += y is the same as x = x + y

当然

的同样适用于其他循环。

来源

2012-02-14 23:50:46 yurib

谢谢您的帮助。您的方法实际上消除了我需要进入并删除不需要的标签，但正如您所提到的，只显示最后一个值。你会建议用什么（高效）的方法来替换我为了显示所有值的集合而实现的'for'循环？ – user1205632 2012-02-15 04:33:43

请忽略该评论！我取消了for循环并实施了您的建议。现在只需要使用BeautifulSoup并清理不需要的标签。 – user1205632 2012-02-15 04:43:34

measures = ['1', '2', '3', '4'] 
values = ['a', 'b', 'c', 'd'] 

for pair in zip(measures, values): 
    print ': '.join(pair) 

# 1: a 
# 2: b 
# 3: c 
# 4: d

关于zip：

Type:  builtin_function_or_method 
Base Class: <type 'builtin_function_or_method'> 
String Form:<built-in function zip> 
Namespace: Python builtin 
Docstring: 
zip(seq1 [, seq2 [...]]) -> [(seq1[0], seq2[0] ...), (...)] 

Return a list of tuples, where each tuple contains the i-th element 
from each of the argument sequences. The returned list is truncated 
in length to the length of the shortest argument sequence.

来源

2012-02-14 23:53:38

用Python重新排列解析的HTML数据

回答

相关问题