报废文章与Python 3.4和BeautifulSoup，请

我想放弃的网站：报废文章与Python 3.4和BeautifulSoup，请

https://xueqiu.com/yaodewang

而且我想放弃他的所有文章。我使用BeautifulSoup和采购这样的：

import requests 
from bs4 import BeautifulSoup 
url = 'https://xueqiu.com/yaodewang' 
header = {'user-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36'} 
r = requests.get(url,headers = header).content 
soup = BeautifulSoup(r,'lxml') 
artile = soup.find_all('ul',{'class':'status-list'}) 
print(artile)

结果是什么这是回报！

[]

SO，我TYR另一个规则是这样的：

# art = soup.find_all('div',{'class':'allStatuses no-head'}) 
# art = soup.find_all('div',{'class':'status_bd'}) 
# art = soup.find_all('div',{'class':'status_content container active tab-pane'})

但是，它返回了一些不正确的词。我想要这样的内容

我需要你的帮助，非常感谢！

来源

2016-05-01 champion Ch

所需的数据实际上不在status-list类的元素中。如果你想查看源代码，你会发现一个空的容器，而不是：

<div class="status_bd"> 
    <div id="statusLists" class="allStatuses no-head"></div> 
</div>

相反，状态都位于script元素，你需要找到里面，提取所需的对象，从JSON加载到Python字典并提取所需的信息：

import json 
import re 
import requests 
from bs4 import BeautifulSoup 

url = 'https://xueqiu.com/yaodewang' 
headers = { 
    'user-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36' 
} 
r = requests.get(url, headers=headers).content 
soup = BeautifulSoup(r, 'lxml') 

pattern = re.compile(r"SNB\.data\.statuses = ({.*?});", re.MULTILINE | re.DOTALL) 
script = soup.find("script", text=pattern) 

data = json.loads(pattern.search(script.text).group(1)) 
for item in data["statuses"]: 
    print(item["description"])

打印：

The best advice: Remember common courtesy and act toward others as you want them to act toward you. 
Lighten up! It&#39;s the weekend. we&#39;re just having a little fun! Industrial Bank is expected to rise,next week... 
... 
点.点.点... 点到这个，学位、学历、成绩单翻译一下要50块、100块的...

来源

2016-05-01 02:24:49 alecxe

非常感谢你much.It是一个正确的methlod但是，我想知道，如果我知道conten！ t位于脚本中，我如何找到这样的正则表达式：pattern = re.compile（r“SNB \ .data \ .statuses =（{。*？}）;”，re.MULTILINE | re.DOTALL） –

另一个问题：我想获得artiles的列表，但现在，我得到了一个字符串。我想得到这样的结果= [str01，str02 .....] –

@championCh当然，只是提取脚本文本并使用它，例如[regex101]（https://regex101.com/）。至于你的第二个问题，我认为你是在询问如何将结果放入一个列表中：'articles = [item [“description”] for data in data [“statuses”]]]'。希望有所帮助。 – alecxe

报废文章与Python 3.4和BeautifulSoup，请

回答

相关问题