1

我想放弃的网站:报废文章与Python 3.4和BeautifulSoup,请

https://xueqiu.com/yaodewang 

而且我想放弃他的所有文章。我使用BeautifulSoup和采购这样的:

import requests 
from bs4 import BeautifulSoup 
url = 'https://xueqiu.com/yaodewang' 
header = {'user-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36'} 
r = requests.get(url,headers = header).content 
soup = BeautifulSoup(r,'lxml') 
artile = soup.find_all('ul',{'class':'status-list'}) 
print(artile) 

结果是什么这是回报!

[] 

SO,我TYR另一个规则是这样的:

# art = soup.find_all('div',{'class':'allStatuses no-head'}) 
# art = soup.find_all('div',{'class':'status_bd'}) 
# art = soup.find_all('div',{'class':'status_content container active tab-pane'}) 

但是,它返回了一些不正确的词。 我想要这样的内容enter image description here

我需要你的帮助,非常感谢!

回答

1

所需的数据实际上不在status-list类的元素中。如果你想查看源代码,你会发现一个空的容器,而不是:

<div class="status_bd"> 
    <div id="statusLists" class="allStatuses no-head"></div> 
</div> 

相反,状态都位于script元素,你需要找到里面,提取所需的对象,从JSON加载到Python字典并提取所需的信息:

import json 
import re 
import requests 
from bs4 import BeautifulSoup 

url = 'https://xueqiu.com/yaodewang' 
headers = { 
    'user-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36' 
} 
r = requests.get(url, headers=headers).content 
soup = BeautifulSoup(r, 'lxml') 

pattern = re.compile(r"SNB\.data\.statuses = ({.*?});", re.MULTILINE | re.DOTALL) 
script = soup.find("script", text=pattern) 

data = json.loads(pattern.search(script.text).group(1)) 
for item in data["statuses"]: 
    print(item["description"]) 

打印:

The best advice: Remember common courtesy and act toward others as you want them to act toward you. 
Lighten up! It&#39;s the weekend. we&#39;re just having a little fun! Industrial Bank is expected to rise,next week... 
... 
点.点.点... 点到这个,学位、学历、成绩单翻译一下要50块、100块的... 
+0

非常感谢你much.It是一个正确的methlod但是,我想知道,如果我知道conten! t位于脚本中,我如何找到这样的正则表达式:pattern = re.compile(r“SNB \ .data \ .statuses =({。*?});”,re.MULTILINE | re.DOTALL) –

+0

另一个问题:我想获得artiles的列表,但现在,我得到了一个字符串。我想得到这样的结果= [str01,str02 .....] –

+0

@championCh当然,只是提取脚本文本并使用它,例如[regex101](https://regex101.com/)。至于你的第二个问题,我认为你是在询问如何将结果放入一个列表中:'articles = [item [“description”] for data in data [“statuses”]]]'。希望有所帮助。 – alecxe