2017-08-15 103 views
0

我想从this链接中获取新闻文章。我的代码是:提取文本<p></p>与BeautifulSoup

def get_news_details(news_url): 
    source = requests.get(news_url) 
    plain_text = source.text 
    soup = BeautifulSoup(plain_text, "html.parser") 
    content = soup.findAll('div', {'class' : 'big-img-box'}) 
    print(content[0].findAll('p')) 

结果表明:

[<p></p>, <p></p>, <p></p>, <p></p>, <p></p>, <p></p>] 

content值:

<div class="big-img-box"> 
<div class="left-imgs"> 
<figure> 
<img alt="iOS developer hints possibility of 4K Apple TV" class="img-responsive" src="http://www.aninews.in/contentimages/detail/appletv.jpg"/> 
<figcaption><span class="heading-inner-span"></span></figcaption> 
</figure> 
<div class="mb10"></div> 
</div> 
<p></p>  New York [USA], August 6 <a class="highlights" href="http://aninews.in/" target="_blank">(ANI)</a>: The latest designs from Apple's HomePod firmware revealed that the tech giant is hinting the launch of a <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/4k-apple-tv.html"> 4K Apple TV</a></span> with high dynamic range (HDR) support for both <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/hdr10.html"> HDR10 </a></span> and <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/dolby-vision.html"> Dolby Vision</a></span>.<p></p>  While the current range of Apple's TV set-top box is incompatible to 4K technology, <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/ios.html">iOS</a></span> developer <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/guilherme-rambo.html"> Guilherme Rambo</a></span> revealed that the company is hinting an adoption of the ultra high-definition format, reports <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/the-verge.html">The Verge</a></span>.<p></p>  Reports of the new range of Apple TV have surfaced time and again over the past few months, starting February this year.<p></p>  It is said that implementing the HDR and 4K content will prove to b beneficial for the company, rather than a simpler resolution, since popular online movie and television platforms like <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/netflix.html"> Netflix</a></span> and <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/amazon.html"> Amazon</a></span> support the two high-definition formats.<p></p>  Last month, iTunes started listing movies as supporting 4K and <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/hdr.html"> HDR</a></span> in users' purchase histories, thus providing more thrust to the speculations of the 4K <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/apple.html"> Apple</a></span> TV. <a class="highlights" href="http://aninews.in/" target="_blank">(ANI)</a><p></p> 
</div> 

我可以content[0].text但我得到的文章的有些笨拙版本无法格式化它。

在检查铬的网页时,文章似乎写在<p>article_text</p>标签里面。而在content中,它显示为<p></p>article_text标签。如果前版本出现在soup,我可以得到我想要的输出。应该做什么 ?

回答

2

这取决于你的意思是格式。你可以用相当简单的方式使它更“整齐”。

>>> import bs4 
>>> import requests 
>>> page = requests.get('http://www.aninews.in/newsdetail-Nw/MzI4NDIy/ios-developer-hints-possibility-of-4k-apple-tv.html').content 
>>> soup = bs4.BeautifulSoup(page, 'lxml') 
>>> big_img_box = soup.select('.big-img-box') 

获取所有文本并剥离空白区域。

>>> big_img_box[0].text.strip() 
"New York [USA], August 6 (ANI): The latest designs from Apple's HomePod firmware revealed that the tech giant is hinting the launch of a 4K Apple TV with high dynamic range (HDR) support for both HDR10 and Dolby Vision.  While the current range of Apple's TV set-top box is incompatible to 4K technology, iOS developer Guilherme Rambo revealed that the company is hinting an adoption of the ultra high-definition format, reports The Verge.  Reports of the new range of Apple TV have surfaced time and again over the past few months, starting February this year.  It is said that implementing the HDR and 4K content will prove to b beneficial for the company, rather than a simpler resolution, since popular online movie and television platforms like Netflix and Amazon support the two high-definition formats.  Last month, iTunes started listing movies as supporting 4K and HDR in users' purchase histories, thus providing more thrust to the speculations of the 4K Apple TV. (ANI)" 

超出此范围并移除较长的内部空白字符串。

>>> import re 
>>> re.sub(r'\s{2,}', ' ', big_img_box[0].text.strip()) 
"New York [USA], August 6 (ANI): The latest designs from Apple's HomePod firmware revealed that the tech giant is hinting the launch of a 4K Apple TV with high dynamic range (HDR) support for both HDR10 and Dolby Vision. While the current range of Apple's TV set-top box is incompatible to 4K technology, iOS developer Guilherme Rambo revealed that the company is hinting an adoption of the ultra high-definition format, reports The Verge. Reports of the new range of Apple TV have surfaced time and again over the past few months, starting February this year. It is said that implementing the HDR and 4K content will prove to b beneficial for the company, rather than a simpler resolution, since popular online movie and television platforms like Netflix and Amazon support the two high-definition formats. Last month, iTunes started listing movies as supporting 4K and HDR in users' purchase histories, thus providing more thrust to the speculations of the 4K Apple TV. (ANI)" 
+0

这适用于我(我的意思是“整理”,谢谢澄清)。但我想知道为什么Chrome的页面检查('

文本

')和BeautifulSoup的版本('

文本')有什么区别? – Aroonalok

+0

我不确定。但是,我会说,当浏览器软件或BeautifulSoup遇到一个未经过编码以符合其标准的页面时,它必须对该代码执行某些操作才能显示它。 Chrome的设计师在遇到问题时可能朝着一个方向发展,而BeautifulSoup的另一个方向。这种情况下的结果有点不同。 –

+1

@BillBell嘿比尔我只是想向你展示对这个StackOverflow标签的良好支持以及对社区的支持,感谢你,你是一个很好的人。祝你一切顺利,我只是想让你知道我们如何感谢你的帮助。 –

相关问题