2016-06-25 17 views
1

中刮取数据我正在尝试关注已发布帖子的链接,以便我可以保存文本。我部分在那里。我只需要调整一些东西,这就是为什么我在这里。而不是不同的职位,我得到重复。不仅如此,他们围在括号中这样我如何关注A链接到特定帖子并从

[[<div class="article-body" id="image-description"><p>Kanye West premiered 
     the music video for "Famous" off his "The Life of Pablo" album to a 
     sold out audience in Los Angeles. The video features nude versions of George W. Bush. 
     Donald Trump. Anna Wintour. Rihanna. Chris Brown. Taylor Swift. 
     Kanye West. Kim Kardashian. Ray J. Amber Rose. Caitlyn Jenner. 
    Bill Cosby (in that order).</p></div>], 

和我的继承人代码

def sprinkle(): 
     url_two = 'http://www.example.com' 
     html = requests.get(url_two, headers=headers) 
     soup = BeautifulSoup(html.text, 'html5lib') 
     titles = soup.find_all('div', {'class': 'entry-pos-1'}) 

     def make_soup(url): 
      the_comments_page = requests.get(url, headers=headers) 
      soupdata = BeautifulSoup(the_comments_page.text, 'html5lib') 
      comment = soupdata.find_all('div', {'class': 'article-body'}) 
      return comment 

     comment_links = [url_two + link.a.get('href') for link in titles] 

     soup = [make_soup(comments) for comments in comment_links] 
      # soup = make_soup(comments) 
      # print(soup) 

     entries = [{'href': url_two + div.a.get('href'), 
        'src': url_two + div.a.img.get('data-original'), 
        'text': div.find('p', 'entry-title').text, 
        'comments': soup 
        } for div in titles][:6] 

     return entries 

我觉得我接近。这对我来说都是新的。任何帮助都会很棒。

+1

他们被称为列表虽然去掉括号。如果你想要什么在你需要遍历它们并提取你想要的,也是如何是你的代码如此相似这个用户http://stackoverflow.com/questions/38022573/whats-the-proper-syntax-to-后续的A-Link-使用-beautifulsoup-请求-IN-A-DJ? –

回答

2

我想通了

def sprinkle(): 
     url_two = 'http://www.vladtv.com' 
     html = requests.get(url_two, headers=headers) 
     soup = BeautifulSoup(html.text, 'html5lib') 
     titles = soup.find_all('div', {'class': 'entry-pos-1'}) 

     def make_soup(url): 
      the_comments_page = requests.get(url, headers=headers) 
      soupdata = BeautifulSoup(the_comments_page.text, 'html5lib') 
      comment = soupdata.find('div', {'class': 'article-body'}) 
      para = comment.find_all('p') 
      return para 

     entries = [{'href': url_two + div.a.get('href'), 
        'src': url_two + div.a.img.get('data-original'), 
        'text': div.find('p', 'entry-title').text, 
        'comments': make_soup(url_two + div.a.get('href')) 
        } for div in titles][:6] 

     return entries 

我试图从结果虽然

相关问题