2014-10-31 34 views
0

我有这样的代码:简单的网页刮板格式化,我该如何解决这个问题?

import requests 
from bs4 import BeautifulSoup 



def posts_spider(): 
    url = 'http://www.reddit.com/r/nosleep/new/' 
    source_code = requests.get(url) 
    plain_text = source_code.text 
    soup = BeautifulSoup(plain_text) 
    for link in soup.findAll('a', {'class': 'title'}): 
     href = "http://www.reddit.com" + link.get('href') 
     title = link.string 
     print(title) 
     print(href) 
     print("\n") 

def get_single_item_data(): 
    item_url = 'http://www.reddit.com/r/nosleep/new/' 
    source_code = requests.get(item_url) 
    plain_text = source_code.text 
    soup = BeautifulSoup(plain_text) 
    for rating in soup.findAll('div', {'class': 'score unvoted'}): 
     print(rating.string) 

posts_spider() 
get_single_item_data() 

输出是:

My light.. I'm seeing and feeling things.. what's happening? 
http://www.reddit.com/r/nosleep/comments/2kw0nu/my_light_im_seeing_and_feeling_things_whats/ 


Why being the first to move in a new Subdivision is not the most brilliant idea... 
http://www.reddit.com/r/nosleep/comments/2kw010/why_being_the_first_to_move_in_a_new_subdivision/ 


I Am Falling. 
http://www.reddit.com/r/nosleep/comments/2kvxvt/i_am_falling/ 


Heidi 
http://www.reddit.com/r/nosleep/comments/2kvrnf/heidi/ 


I remember everything 
http://www.reddit.com/r/nosleep/comments/2kvrjs/i_remember_everything/ 


To Lieutenant Griffin Stone 
http://www.reddit.com/r/nosleep/comments/2kvm9p/to_lieutenant_griffin_stone/ 


The woman in my room 
http://www.reddit.com/r/nosleep/comments/2kvir0/the_woman_in_my_room/ 


Dr. Margin's Guide to New Monsters: The Guest, or, An Update 
http://www.reddit.com/r/nosleep/comments/2kvhe5/dr_margins_guide_to_new_monsters_the_guest_or_an/ 


The Evil Woman (part 5) 
http://www.reddit.com/r/nosleep/comments/2kva73/the_evil_woman_part_5/ 


Blood for the blood god, The first of many. 
http://www.reddit.com/r/nosleep/comments/2kv9gx/blood_for_the_blood_god_the_first_of_many/ 


An introduction to the beginning of my journey 
http://www.reddit.com/r/nosleep/comments/2kv8s0/an_introduction_to_the_beginning_of_my_journey/ 


A hunter..of sorts. 
http://www.reddit.com/r/nosleep/comments/2kv8oz/a_hunterof_sorts/ 


Void Trigger 
http://www.reddit.com/r/nosleep/comments/2kv84s/void_trigger/ 


What really happened to Amelia Earhart 
http://www.reddit.com/r/nosleep/comments/2kv80r/what_really_happened_to_amelia_earhart/ 


I Used To Be Fine Being Alone 
http://www.reddit.com/r/nosleep/comments/2kv2ks/i_used_to_be_fine_being_alone/ 


The Green One 
http://www.reddit.com/r/nosleep/comments/2kuzre/the_green_one/ 


Elevator 
http://www.reddit.com/r/nosleep/comments/2kuwxu/elevator/ 


Scary story told by my 4 year old niece- The Guy With Really Big Scary Claws 
http://www.reddit.com/r/nosleep/comments/2kuwjz/scary_story_told_by_my_4_year_old_niece_the_guy/ 


Cranial Nerve Zero 
http://www.reddit.com/r/nosleep/comments/2kuw7c/cranial_nerve_zero/ 


Mom's Story About a Ghost Uncle 
http://www.reddit.com/r/nosleep/comments/2kuvhs/moms_story_about_a_ghost_uncle/ 


It snowed. 
http://www.reddit.com/r/nosleep/comments/2kutp6/it_snowed/ 


The pocket watch I found at a store 
http://www.reddit.com/r/nosleep/comments/2kusru/the_pocket_watch_i_found_at_a_store/ 


You’re Going To Die When You Are 23 
http://www.reddit.com/r/nosleep/comments/2kur3m/youre_going_to_die_when_you_are_23/ 


The Customer: Part Two 
http://www.reddit.com/r/nosleep/comments/2kumac/the_customer_part_two/ 


Dimenhydrinate 
http://www.reddit.com/r/nosleep/comments/2kul8e/dimenhydrinate/ 


• 
• 
• 
• 
• 
12 
12 
76 
4 
2 
4 
6 
4 
18 
2 
6 
13 
5 
16 
2 
2 
14 
48 
1 
13 

我想要做的是,放置匹配评价每篇文章就在旁边,所以我可以立刻告诉该帖子具有多少评级,而不是在1个“块”中打印标题和链接,而是在另一个“块”中打印评级号码。 在此先感谢您的帮助!

+0

你有没有试过这个:http://www.reddit.com/dev/api? – 2014-10-31 15:36:26

+0

具体来说:http://www.reddit.com/r/python/new.json – 2014-10-31 15:37:32

回答

1

您可以通过迭代div元素与class="thing"(考虑它作为遍历帖子)一次完成。对于每个div,得到该链接,等级:

from urlparse import urljoin 

from bs4 import BeautifulSoup 
import requests 

def posts_spider(): 
    url = 'http://www.reddit.com/r/nosleep/new/' 
    soup = BeautifulSoup(requests.get(url).content) 
    for thing in soup.select('div.thing'): 
     link = thing.find('a', {'class': 'title'}) 
     rating = thing.find('div', {'class': 'score'}) 
     href = urljoin("http://www.reddit.com", link.get('href')) 

     print(link.string, href, rating.string) 

posts_spider() 

仅供参考,div.thingCSS Selector所有div s的class="thing"匹配。

+0

在我做同样的事情之前,你从字面上发布了一分钟。作为一个方面说明,我相信评级应该是'find('span',{'class':'rank'})' – Anzel 2014-10-31 15:53:35

+0

@Anzel是的,我正在考虑它,然后我发现OP正在使用'评分“ - 我认为这是OP的真正含义。我们拭目以待。谢谢。 – alecxe 2014-10-31 15:54:37

+0

你是对的! OP正在使用'score unvoted',多么不寻常 – Anzel 2014-10-31 16:01:21