Python - 检索文章是否有作者

-2

我正在尝试编写一个Python脚本来检索文章是否有作者。Python - 检索文章是否有作者

我写了下面：

s = "https://www.nytimes.com/2017/08/18/us/politics/steve-bannon-trump-white-house.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=a-lede-package-region&region=top-news&WT.nav=top-news" 

def checkForAuthor(): 
    r = requests.get(s) 
    return "By" in r.text 

print(checkForAuthor())

的问题是，函数返回checkForAuthor即使true时，有没有作者，因为它搜索单词整个HTML内容。找到作者而不搜索整个文档有更好的逻辑吗？比如在标题内搜索，所以我甚至不需要搜索文章内容。我确实需要制作这个通用的搜索引擎，以便我搜索到的任何网站都能给出结果。不确定那里有什么东西。

来源

2017-08-19 Kobbi Gal

你应该有一些适当的库解析HTML和检查只有标签哟你对此感兴趣。 –

从网页抓取数据的关键部分是查看网页的HTML源代码以正确获取数据。在您提供的链接中，有以下几行包含作者信息。

<meta name="author" content="Maggie Haberman, Michael D. Shear and Glenn Thrush" /> 
<meta name="byl" content="By MAGGIE HABERMAN, MICHAEL D. SHEAR and GLENN THRUSH" /> 
<meta property="article:author" content="https://www.nytimes.com/by/maggie-haberman" /> 
<meta property="article:author" content="https://www.nytimes.com/by/michael-d-shear" /> 
<meta property="article:author" content="https://www.nytimes.com/by/glenn-thrush" />

还有其他人，但这些应该有所帮助。要解析这些标签，您可以使用。

来源

2017-08-19 11:41:44 TrigonaMinima

要解析html并查找所需的数据，应该使用BeautifulSoup库。

在您的网站的HTML，有一个meta标签与作者：

<meta content="By MAGGIE HABERMAN, MICHAEL D. SHEAR and GLENN THRUSH" name="byl"/>

因此，要检查是否有一个作家，你需要它的名字（byl）找到它：

import requests 
from bs4 import BeautifulSoup 

s = "https://www.nytimes.com/2017/08/18/us/politics/steve-bannon-trump-white-house.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=a-lede-package-region&region=top-news&WT.nav=top-news" 

def checkForAuthor(): 
    soup = BeautifulSoup(requests.get(s).content, 'html.parser') 
    meta = soup.find('meta', {'name': 'byl'}) 
    return meta is not None

其实，你也可以得到作者的名字与meta["content"]

来源

2017-08-19 11:57:57 Ricardo

Python - 检索文章是否有作者

回答

相关问题