2014-09-25 53 views
0

我是新手,我开始使用BeautifulSoup和Python开发,并且我希望以全文形式获取结果,而不使用任何HTML标记或其他非文本元素。使用BeautifulSoup解析并使用特殊格式获得结果

我这样做是使用Python:

#!/usr/bin/env python 

import urllib2 
from bs4 import BeautifulSoup 

html_content = urllib2.urlopen("http://www.demo.com/index.php") 

soup = BeautifulSoup(html_content, "lxml") 

# COMMENTS COUNT 
count_comment = soup.find("span", "sidebar-comment__label") 
count_comment 
count_comment_final = count_comment.find_next("meta") 


# READ COUNT 
count_read = soup.find("span", "sidebar-read__label js-read") 
count_read 
count_read_final = count_read.find_next("meta") 

# PRINT RESULT 
print count_comment_final 
print count_read_final 

我的HTML看起来像这样:

<div class="box"> 
     <span class="sidebar-comment__label">Comments</span> 
     <meta itemprop="interactionCount" content="Comments:115"> 
</div> 


<div class="box"> 
     <span class="sidebar-read__label js-read">Read</span> 
     <meta itemprop="interactionCount" content="Read:10"> 
</div> 

,我得到这个:

<meta content="Comments:115" itemprop="interactionCount"/> 
<meta content="Read:10" itemprop="interactionCount"/> 

我会得到这样的:

You've 115 comments 
You've 10 read 

首先,这可能吗?

其次,我的代码好吗?

第三,你能帮助我吗? ;-)

回答

1

count_comment_finalcount_read_final是从输出中清楚看到的标签。您需要提取两个标签的属性content的值。这是使用count_comment_final['content']完成这将给作为Comments:115,使用split(':')

#!/usr/bin/env python 

import urllib2 
from bs4 import BeautifulSoup 

html_content = urllib2.urlopen("http://www.demo.com/index.php") 

soup = BeautifulSoup(html_content, "lxml") 

# COMMENTS COUNT 
count_comment = soup.find("span", "sidebar-comment__label") 
count_comment 
count_comment_final = count_comment.find_next("meta") 


# READ COUNT 
count_read = soup.find("span", "sidebar-read__label js-read") 
count_read 
count_read_final = count_read.find_next("meta") 

# PRINT RESULT 
print count_comment_final['content'].split(':')[1] 
print count_read_final['content'].split(':')[1] 
+0

差不多完成了,它会显示 “注释” 和 “读”,而不是 “115” 和 “10”。 – TwinyTwice 2014-09-25 05:18:41

+0

使用'split(':')[1]'。对不起 – nu11p01n73R 2014-09-25 05:20:17

1

count_comment_finalcount_read_final是标签元件, 可以使用剥去Comments:

count_comment_final.get('content') 

这会给这样的输出,

'Comments:115' 

所以你可以得到评论count:伯爵一样,

count_comment_final.get('content').split(':')[1] 

同样适用于count_read_final

count_read_final.get('content').split(':')[1]