2015-04-14 23 views
1

我正在尝试扫描cnn.com最受欢迎的新闻故事,并从前十个链接中提取新闻文章并将文章保存为文本,以便可以计算最常用的在它的话。它看起来不像我从我的代码中获取网页的顶部链接。任何帮助,将不胜感激。我如何使它只能看cnn.com/mostpopular上的前10个链接?从最受欢迎的新闻故事中获取文本

import urllib2 
from bs4 import BeautifulSoup 

html = urllib2.urlopen('http://www.cnn.com/mostpopular/').read() 
soup = BeautifulSoup(html) 
for item in soup.find_all(attrs={'class': 'cnnWCBoxContent'}): 
    for link in item.find_all('a'): 
     for item in link.get('href') 
      #soups = BeautifulSoup(item) 
      #soups.find_all(
      print item 

回答

1

为了得到你所感兴趣的东西,你需要访问"cnnMostPopularTabs1"和得到所有的"cnnMPContentHeadline"

从BS4进口BeautifulSoup

import requests 

r = requests.get("http://edition.cnn.com/mostpopular/") 

data = BeautifulSoup(r.content).find("div",{"id":"cnnMostPopularTabs1"}).find_all("div",{"class":"cnnMPContentHeadline"}) 

from pprint import pprint as pp 
pp([d.a["href"] for d in data]) 

输出:

['http://edition.cnn.com/2014/12/30/world/out-of-the-phone-instagram-photography/index.html', 
'http://edition.cnn.com/2014/12/29/living/feat-ivf-mom-gives-birth-quads/index.html', 
'http://edition.cnn.com/2014/08/28/world/asia/north-korea-inoki-japan-wrestling/index.html', 
'http://edition.cnn.com/2014/12/16/travel/best-destinations-2015/index.html', 
'http://edition.cnn.com/2014/12/26/opinion/soussan-weingarten-gender-equality/index.html', 
'http://edition.cnn.com/2014/12/09/opinion/yang-mark-wahlberg/index.html', 
'http://edition.cnn.com/2014/12/04/tech/innovation/make-create-innovate-bloodhound-supersonic-car/index.html', 
'http://edition.cnn.com/2014/12/29/politics/obama-golf-hawaii/index.html', 
'http://edition.cnn.com/2014/12/10/sport/football/twitter-trends-sport-world-cup-mario-balotelli-list/index.html', 
'http://edition.cnn.com/2014/12/19/travel/new-2015-hotels/index.html'] 

你也可以切片find_all("div",{"class":"cnnMPContentHeadline"})

data = BeautifulSoup(r.content).find_all("div",{"class":"cnnMPContentHeadline"}) 
from pprint import pprint as pp 
pp([d.a["href"] for d in data[:10]]) 

输出:

['http://edition.cnn.com/2014/12/30/world/out-of-the-phone-instagram-photography/index.html', 
'http://edition.cnn.com/2014/12/29/living/feat-ivf-mom-gives-birth-quads/index.html', 
'http://edition.cnn.com/2014/08/28/world/asia/north-korea-inoki-japan-wrestling/index.html', 
'http://edition.cnn.com/2014/12/16/travel/best-destinations-2015/index.html', 
'http://edition.cnn.com/2014/12/26/opinion/soussan-weingarten-gender-equality/index.html', 
'http://edition.cnn.com/2014/12/09/opinion/yang-mark-wahlberg/index.html', 
'http://edition.cnn.com/2014/12/04/tech/innovation/make-create-innovate-bloodhound-supersonic-car/index.html', 
'http://edition.cnn.com/2014/12/29/politics/obama-golf-hawaii/index.html', 
'http://edition.cnn.com/2014/12/10/sport/football/twitter-trends-sport-world-cup-mario-balotelli-list/index.html', 
'http://edition.cnn.com/2014/12/19/travel/new-2015-hotels/index.html'] 

我会建议不要切片,因为总有可能会或多或少的联系的可能性。

为了让段落文本,你可以找到cnn_strylftcntnt然后find_all_next普的:

for link in (d.a["href"] for d in data): 
    r = requests.get(link) 
    div = BeautifulSoup(r.content).find("div",{"class":"cnn_strylftcntnt"}) 
    if div: 
     print("Text for {}".format(link)) 
     print("".join([p.text for p in div.find_all_next("p")])) 
    else: 
     print("No text for link {}".format(link)) 
    print() 

输出:

Text for http://edition.cnn.com/2014/12/30/world/out-of-the-phone-instagram-photography/index.html 
(CNN) -- Gone are the days of the grainy camera phone images with the resolution of a poor imitation Monet. Today's smartphone cameras are so advanced that mobile photography is becoming an art form in its own right, turning photo-sharing apps like Instagram into portable galleries for amateur photographers, and professionals like street style photographer Tommy Ton and chief official White House photographer Pete Souza."You have the dark room in your pocket," says Pierre Le Govic, the Paris-based founder of Out of the Phone, the world's first publishing house dedicated to mobile photography.This month, Out of the Phone follows its debut publication, last year's book of mobile photos from two-time Pulitzer Prize-nominated photographer Richard Koci Hernandez, with Out of the Phone: The Mobile Photo Book 2014, a diverse selection of 100 Instagram images taken by users from 25 countries.Read: The decaying splendor of abandoned Italian nightclubsDemocratizing photography Before founding Out of the Phone in 2013, Le Govic ran a fine art photography printing company that counted Daido Moriyama and William Eggleston as clients. He first started following mobile photography on Instagram in 2011, and was surprised and impressed by the quality of work that hobbyists were creating."Now there are many well known photographers who use the platform, but at the very beginning, there were many people who didn't know so much about photography, and these were the kind of people that I wanted to showcase," he says. "But on the other hand, it was also something confusing because there are too many images."The desire to curate what he was seeing, coupled with a longtime ambition to create books, led him to give publishing a try.While Le Govic had preselected a number of established photographers to feature in this year's inaugural anthology (he's hoping it will become an annual publication), he also gave Instagram users the chance to put themselves up for consideration, using the hashtag #outofthephone to nominate their best works. He was astounded to receive over 20,000 submissions.What was he looking for in a successful entry? Technical skill was understandably important, but Le Govic says he also sought something less tangible."At the end, what is important is the story and the sensibility of the photographer ... It's a mix between a good story, a good composition," he says. "Photography, for me, is a sort of fresh air, a way to look at things differently. So I'm looking for that sort of feeling when I look at pictures."Preserving "moments of grace"Now that The Mobile Photo Book has been published, Le Govic is looking forward to promoting his concept and expanding. He's looking to start hiring in the New Year (so far, it's been a one-man operation), and solicit investors and partners. Several projects are set for release next year, including books from award-winning documentary photographer Benjamin Lowy, and other photographers he believes are using the medium to its fullest.Read: Behind the scenes at the legendary Studio 54"Some images deserve to get to paper because it's a kind of memory," he says. "If I can help to keep memory of interesting moments, some moments of grace perhaps...I think it's interesting to fix them on paper and to alert to people not to forget them."Out of the Phone: The Mobile Photo Book 2014 is available for purchase online.Unseen pictures of the Rolling Stones and Pink FloydSupercar Shangri-La: Full throttle through Italy's 'Motor Valley'This aerial photographer captures the eerie geometry of lifeA peek inside Europe's most prestigious photography festival 

Text for http://edition.cnn.com/2014/12/29/living/feat-ivf-mom-gives-birth-quads/index.html 
(CNN) -- A Utah couple whose journey through in-vitro fertilization captivated the nation welcomed quadruplets -- two sets of identical twins -- Sunday.Ashley and Tyson Gardner said they are "overwhelmed with joy" after the birth of Indie, Esme, Scarlett and Evangeline by Caesarean section at Utah Valley Regional Medical Center in Provo. Three of the newborns weighed a little more than 2 pounds at delivery. The fourth weighed slightly less than 2 pounds, according to the hospital.The Gardners announced the news on the Facebook page where they share news about the pregnancy."Mom and babies are doing incredible!!! We are so happy with how everything turned out today! The doctors, nurses, and staff were incredible!! More updates to follow soon!!"The Pleasant Grove couple conceived two sets of identical twins this summer with the help of in-vitro fertilization. In October, Ashley Gardner had emergency laser surgery in California to save one set suffering from twin-to-twin transfusion syndrome, the hospital said in a news release. She began staying in an antepartum suite at Utah Valley Regional in November after doctors decided hospital bed rest was necessary.The four girls, dubbed the "Quad Squad" by the hospital, were due March 11. Doctors decided to deliver them 12 weeks early after discovering that Ashley Gardner had ruptured some membranes and her contractions continued to progress in intensity, the hospital said.Complications leading to premature delivery are common in multiple gestations, whether achieved naturally or though IVF, said Dr. Andrew Toledo, CEO of Reproductive Biology Associates in Atlanta, the largest IVF program in the Southeast. But data show that women who achieve pregnancy through IVF have a slightly higher rate of complications compared with patients who conceive naturally.It's also extremely rare for both embryos to split, but it's more common in IVF pregnancies compared with patients who conceive naturally, he said.In a YouTube video posted Sunday morning from the hospital, Tyson Gardner said that mom and the babies were doing well after a night in the hospital and that they expected the quads to come in the next couple of days."We need lots of prayers the next 48 hours," Ashley Gardner said from her hospital bed.The Gardners tried for years to get pregnant. Finally, they learned in July that their first in-vitro fertilization attempt was successful. But the real surprise came during the ultrasound, when they learned she was pregnant with quadruplets.A friend in the room captured the priceless look on her face in a picture that took the Internet by storm. In one week, the Gardners' Facebook page grew by nearly 16,000 likes to 24,300. Today, it has almost 300,000 Facebook fans, and the TV network TLC is following them for a series set to air in 2015.Well-wishers flooded their Facebook page Monday with congratulations and requests for pictures."Congratulations," one person said. "Wishing you health and happiness for many years to come." 

Text for http://edition.cnn.com/2014/08/28/world/asia/north-korea-inoki-japan-wrestling/index.html 
Pyongyang (CNN) -- It is exceedingly rare for Western journalists to be allowed inside the Democratic Peoples Republic of Korea (DPRK) -- commonly known as North Korea. It is even less common for an American reporter to visit this reclusive nation, home to nearly 25 million people who are essentially isolated from the rest of the world.Yet here I am, an American member of a CNN crew, reporting from Pyongyang about the latest high profile sporting event to sweep this city since a bizarre basketball tournament earlier this year.You probably remember when American NBA star Dennis Rodman organized a basketball tournament in Pyongyang.Rodman was widely criticized in the United States for befriending the DPRK's Supreme Leader Kim Jong Un, whose authoritarian regime has been accused by a United Nations panel of widespread human rights abuses, charges that North Korea strongly denies. 'Sports diplomacy'Outside press were not invited to cover Rodman's trip. This time, CNN is among a handful of news organizations granted rare access to Pyongyang to cover the International Pro Wrestling Festival.Retired Japanese wrestling star turned politician Kanji "Antonio" Inoki is organizing the event. In his professional heyday, Inoki fought in a memorable and bizarre 1976 match in Tokyo with boxing great Muhammad Ali. Today, as an aging member of the Japanese parliament, he is once again in the headlines for his latest attempt at what he calls "sports diplomacy" between Japan and North Korea.Inoki is holding the event in the home country of Rikidozan, his late wrestling mentor. He says it will bring together professional fighters from the United States, China, and several other countries. The wrestlers are also scheduled to tour Pyongyang and interact with North Korean fans.Our journey so farAfter landing in Pyongyang, we headed to our hotel,which sits on its own island.Complete with a microbrewery, the hotel tries to give journalists on this trip a Western experience, serving simple Western-style omelettes and potatoes for breakfast. Dinner was a Korean-style meal.Taking a look around the city, we saw some people holding cell phones, which looked like small Blackberrys. People weren't blindly walking about with their eyes locked on the screen; a common sight in Western cities.These were not touch-screen phones, instead gadgets where people can access the internal net and visit certain North Korean sites like government sites and the country's largest newspaper.On Friday morning, we visited the birthplace of North Korean founder, Kim Il Sung. This site is considered sacred -- every North Korean who visits the capital goes there. Bus loads of school children, who took a 23-hour trip from a northern rural province, arrived at the site to take a look.Asked about how they felt about being there, the students recited facts about the place. Even when our minders encouraged them to speak with us, it appeared they were shy or nervous facing foreigners and TV cameras.We headed to the Munsu Water Park, a park with water slides and pools, that current leader, Kim Jong Un, is said to have personally scrutinized 113 times. There weren't many children there, though many North Korean families appeared to be enjoying the activities.The rest of Friday will be spent visiting a new pediatric hospital and a sports village -- all in Pyongyang.During our tightly-controlled five-day trip, we will be under the constant supervision of government minders. We are staying in a hotel on an island -- in the middle of a river -- and we aren't allowed to leave without our government-assigned escorts. We expect them to monitor what we shoot and step-in to stop us if we point our cameras in the wrong direction.We expect to see only what the government will allow us to see -- the landmarks of Pyongyang, omnipresent tributes to the Kim family regime, and majestic displays of patriotic pageantry.Thawing relationsThis unusual visit to the Hermit Kingdom comes at a time when years of frosty relations between Tokyo and Pyongyang could be beginning to thaw.In July, Japanese Prime Minister Shinzo Abe eased several unilateral sanctions on North Korea after the two countries made progress in talks about Japanese citizens kidnapped by the North Korean regime during the Cold War.The Japanese government says North Korean operatives kidnapped at least 17 Japanese citizens in the late 1970s and early 1980s and possibly dozens more.In 2002, North Korea shocked the international community by admitting to the kidnappings and returning five victims to Japan. But questions still linger about the fate of the remaining 12 confirmed abductees and the other suspected cases.A North Korean "Special Investigative Committee" of about 30 government officials is expected to update the Japanese government in the next few weeks on the status of missing Japanese citizens. Families of the abducted hope renewed diplomacy between the two countries will bring long-awaited answers.Among the Japanese sanctions lifted is a restriction asking its citizens not to travel to North Korea, which opens the door for more Japanese tourists to embark on commercial tours of the country.Behind the curtainOur flight on North Korea's only airline (one of just 10 scheduled flights a week) was packed with mostly Japanese press and an eclectic group of wrestlers who will tour Pyonyang and entertain crowds who rarely see anything like this in their country.At a press conference, one North Korean official said he hopes the event will bring the DPRK closer to Japan after years of tension.Even though decades of isolation and crippling sanctions have left North Korea struggling economically and lagging far behind much of the developed world in terms of technology and infrastructure -- the nation is nearly unrivaled in its ability to mobilize tens of thousands of citizens to put on a spectacular show.It remains yet to be seen if we will get a glimpse behind the curtain to witness the true reality of life in one of the most secretive places on Earth.I asked our government minders if they'd be willing to show us what life is really like for regular people in North Korea. They said they'd ask their superiors and get back to us.READ: Dennis Rodman returns after visit to North KoreaREAD: Abductee's parents finally meet North Korean granddaughter 

........... 

我可以只添加一对夫妇的输出,有30000个字符的限制。

您也将获得以下链接没有文本,因为没有cnnMPContentHeadlincnn_strylftcntnt标签:

No text for link http://edition.cnn.com/2014/12/04/tech/innovation/make-create-innovate-bloodhound-supersonic-car/index.html 

如果你想有一个字计数使用collections.Counter字典,降低了文本和剥离标点符号从词语:

from collections import Counter, OrderedDict 
from itertools import chain 
from string import punctuation 

all_links_counters = OrderedDict() 

for link in [d.a["href"] for d in data][0:1]: 
    r = requests.get(link) 
    div = BeautifulSoup(r.content).find("div", {"class": "cnn_strylftcntnt"}) 
    if div: 
     print("Text for {}".format(link)) 
     words = chain.from_iterable(p.text.lower().split() for p in div.find_all_next("p")) 
    all_links_counters[link] = Counter(word.strip(punctuation) for word in words) 
    else: 
     print("No text for link {}".format(link)) 
    print() 

print(all_links_counters) 

一个例子输出用于第一链路:

[Counter({'the': 33, 'of': 23, 'to': 20, 'a': 15, 'and': 13, 'he': 10, 'photography': 8, 'for': 7, 'mobile': 7, 'in': 7, 'was': 6, 'that': 6, 'are': 6, 'photographer': 6, 'phone': 6, 'is': 6, 'at': 5, 'govic': 5, 'le': 5, 'out': 5, 'says': 5, "it's": 4, 'so': 4, 'instagram': 4, 'images': 4, 'photographers': 4, 'looking': 4, 'book': 4, 'with': 3, 'on': 3, 'what': 3, 'people': 3, 'moments': 3, 'its': 3, 'but': 3, 'i': 3, 'there': 3, 'photo': 3, 'from': 3, 'also': 3, 'many': 3, 'were': 3, 'this': 3, '': 2, 'now': 2, "he's": 2, 'interesting': 2, 'some': 2, 'publishing': 2, 'like': 2, 'other': 2, 'an': 2, 'house': 2, 'been': 2, 'important': 2, 'first': 2, '2014': 2, 'by': 2, 'because': 2, 'pictures': 2, 'read': 2, 'grace': 2, "year's": 2, 'year': 2, 'memory': 2, 'books': 2, 'publication': 2, 'good': 2, 'it': 2, 'sort': 2, 'something': 2, 'look': 2, 'story': 2, 'who': 2, 'art': 2, 'paper': 2, 'using': 2, 'kind': 2, 'them': 2, 'users': 2, 'studio': 1, 'company': 1, 'souza': 1, 'founding': 1, 'hashtag': 1, 'longtime': 1, 'give': 1, 'countries': 1, 'resolution': 1, 'less': 1, 'alert': 1, 'professionals': 1, 'air': 1, 'investors': 1, '54': 1, 'eggleston': 1, 'fullest': 1, 'month': 1, 'galleries': 1, 'very': 1, 'apps': 1, 'things': 1, 'following': 1, '2011': 1, 'documentary': 1, 'rolling': 1, 'creating': 1, 'create': 1, 'differently': 1, 'stones': 1, 'successful': 1, 'much': 1, 'composition': 1, 'eerie': 1, 'next': 1, 'feature': 1, 'best': 1, 'floyd': 1, 'far': 1, 'medium': 1, 'one-man': 1, 'pete': 1, 'prestigious': 1, 'street': 1, 'set': 1, 'published': 1, 'legendary': 1, 'when': 1, 'partners': 1, 'two-time': 1, 'your': 1, 'has': 1, 'follows': 1, 'ran': 1, 'valley': 1, 'hoping': 1, 'dark': 1, 'not': 1, 'understandably': 1, 'aerial': 1, 'right': 1, 'shangri-la': 1, 'submissions': 1, 'up': 1, "europe's": 1, 'pocket': 1, 'started': 1, 'smartphone': 1, 'decaying': 1, 'inside': 1, 'camera': 1, 'confusing': 1, 'nightclubs': 1, 'you': 1, 'sought': 1, 'cameras': 1, 'think': 1, '2013': 1, 'own': 1, 'democratizing': 1, 'counted': 1, 'splendor': 1, 'award-winning': 1, 'hiring': 1, 'portable': 1, 'projects': 1, 'festival': 1, 'themselves': 1, 'richard': 1, 'most': 1, 'turning': 1, 'quality': 1, 'astounded': 1, "italy's": 1, 'diverse': 1, 'life': 1, 'entry': 1, 'believes': 1, 'have': 1, 'works': 1, 'geometry': 1, 'gone': 1, 'fine': 1, 'can': 1, 'mix': 1, 'photo-sharing': 1, "didn't": 1, 'while': 1, 'selection': 1, 'fix': 1, 'new': 1, 'put': 1, 'ambition': 1, "i'm": 1, 'beginning': 1, 'know': 1, 'hernandez': 1, 'preserving': 1, 'skill': 1, 'gave': 1, 'keep': 1, 'peek': 1, 'paris-based': 1, 'start': 1, 'pierre': 1, 'me': 1, 'into': 1, 'motor': 1, 'imitation': 1, 'online': 1, 'style': 1, 'ton': 1, 'days': 1, 'if': 1, 'including': 1, 'annual': 1, 'purchase': 1, 'concept': 1, 'photos': 1, 'led': 1, 'advanced': 1, 'hand': 1, 'between': 1, 'chance': 1, 'him': 1, 'will': 1, 'had': 1, 'white': 1, 'lowy': 1, 'too': 1, 'before': 1, 'end': 1, 'chief': 1, 'pink': 1, 'koci': 1, 'several': 1, 'available': 1, 'become': 1, 'amateur': 1, 'through': 1, 'wanted': 1, 'technical': 1, 'curate': 1, 'italian': 1, 'about': 1, 'unseen': 1, 'well': 1, 'becoming': 1, 'impressed': 1, 'sensibility': 1, 'full': 1, 'outofthephone': 1, 'moriyama': 1, 'receive': 1, 'their': 1, 'help': 1, 'benjamin': 1, 'grainy': 1, 'forward': 1, 'deserve': 1, 'monet': 1, 'abandoned': 1, 'william': 1, 'forget': 1, 'get': 1, 'use': 1, 'way': 1, 'prize-nominated': 1, 'promoting': 1, 'throttle': 1, 'expanding': 1, 'hobbyists': 1, 'try': 1, 'operation': 1, 'coupled': 1, 'showcase': 1, 'scenes': 1, "today's": 1, 'taken': 1, 'these': 1, 'tommy': 1, "world's": 1, 'anthology': 1, 'official': 1, 'debut': 1, 'behind': 1, 'work': 1, 'pulitzer': 1, '25': 1, '100': 1, 'number': 1, 'perhaps...i': 1, 'known': 1, 'fresh': 1, 'founder': 1, 'cnn': 1, 'seeing': 1, 'feeling': 1, 'desire': 1, 'established': 1, 'poor': 1, '20,000': 1, 'supercar': 1, 'preselected': 1, 'nominate': 1, 'printing': 1, 'daido': 1, 'over': 1, 'form': 1, 'captures': 1, 'last': 1, 'solicit': 1, 'his': 1, 'release': 1, 'room': 1, 'as': 1, 'surprised': 1, 'platform': 1, 'tangible': 1, 'clients': 1, 'consideration': 1, 'inaugural': 1, 'dedicated': 1})] 
+0

我不认为你需要检查div。它看起来只有那个类被设置为“cnnMPContentHeadline”的内容是标题链接自己。 – RNikoopour

+0

@RNikoopour,除非你想切片,否则这是不可靠的方法。如果有更多或更少的标题,会发生什么?加上切片创建一个新的列表,所以有很多原因不使用切片 –

+0

我没有跟着你。请你多解释一下。页面设置的方式始终是前15个故事,每篇文章都设置了类“cnnMPContentHeadline”,所以我不明白需要搜索“div”还是为什么它不可靠只搜索班级。 – RNikoopour

0
import requests 
import bs4 

response = requests.get('http://www.cnn.com/mostpopular/') 
soup = bs4.BeautifulSoup(response.text) 
links = [] 
for i in soup.find_all(class_='cnnMPContentHeadline')[:10]: 
    links.append((i.text.strip(), i.find('a')['href'])) 

这会给你一个带有文章名和链接的元组列表。然后,您将遍历此列表并请求每个链接并从中提取文章内容。

for title, link in links: 
    response = requests.get(link) 
    # Get article information from response