webscraping和保存JSON作为结果

-2

我想刮与beautifulsoup网站这样：webscraping和保存JSON作为结果

从主页上的40类只是名称
然后去每一个类别，如（startupstash .COM/ideageneration /），并且其中会有一些子类
现在去每个子类假设第一个startupstash.com/resource/milanote/并采取内容细节

4.对于所有40个类别+子类别数量+每个子类别详细信息，也是如此。

请有人能提供我一个想法如何approach..or法beautifulsoup..or可能code..i尝试下来的东西

import requests 
from bs4 import BeautifulSoup 
headers={'User-Agent':'Mozilla/5.0'} 


base_url="http://startupstash.com/" 
req_home_page=requests.get(base_url,headers=headers) 
soup=BeautifulSoup(req_home_page.text, "html5lib") 
links_tag=soup.find_all('li', {'class':'categories-menu-item'}) 
titles_tag=soup.find_all('span',{'class':'name'}) 
links,titles=[],[] 

for link in links_tag: 
    links.append(link.a.get('href')) 
#print(links) 
for title in titles_tag: 
    titles.append(title.getText()) 
print("HOME PAGE TITLES ARE \n",titles)                
#HOME PAGE RESULT TITLE FINISH HERE 

for i in range(0,len(links)): 
    req_inside_page = requests.get(links[i],headers=headers) 
    page_store =BeautifulSoup(req_inside_page.text, "html5lib") 
    jump_to_next=page_store.find_all('div', { 'class' : 'company-listing more' }) 
    nextlinks=[] 
    for div in jump_to_next: 
     nextlinks.append(div.a.get("href")) 
    print("DETAIL OF THE LINKS IN EVERY CATEGORIES SCRAPPED HERE \n",nextlinks)      #SCRAPPED THE WEBSITES IN EVERY CATEGORIES 

    for j in range(0,len(nextlinks)): 
     req_final_page=requests.get(nextlinks[j],headers=headers) 
     page_stored=BeautifulSoup(req_final_page.text,'html5lib') 
     detail_content=page_stored.find('div', { 'class' : 'company-page-body body'}) 
     details,website=[],[] 
     for content in detail_content: 
     details.append(content.string) 
     print("DESCRIPTION ABOUT THE WEBSITE \n",details)          #SCRAPPED THE DETAILS OF WEBSITE 


     detail_website=page_stored.find('div',{'id':"company-page-contact-details"}) 
     table=detail_website.find('table') 
     for tr in table.find_all('tr')[2:]: 
      tds=tr.find_all('td')[1:] 
      for td in tds: 
       website.append(td.a.get('href')) 
       print("VISIT THE WEBSITE \n",website)

来源

2017-04-26 pupu

你有什么确切的问题？请描述你尝试过的和无法实现的。没有人会为你写出整个刮板。 – VeGABAU

@ VeGABAU ..我只需要解决这个整个网站的方法..从第一页我需要所有的类别名称，第二个去每个类别和第三个从第三页采取细节部分..... – pupu

好吧，首先您需要添加“用户代理”在你的头文件中模拟一个网页浏览器（请不要滥用网站）。
然后你可以提取从第一页的链接这一行：

links = [ li.a.get('href') for li in soup.find_all('li', {'class':'categories-menu-item'}) ]

然后遍历这些链接，并得到他们每个人的链接：

links = [ div.a.get('href') for div in soup.find_all('div', { 'class' : 'company-listing-more' }) ]

最后得到的内容：

content = soup.find('div', { 'class' : 'company-page-body body'}).text

来源

2017-04-26 20:00:15

Heartfull谢谢亲爱的@adam – pupu

这并不难，所有你需要做的就是检查html并选择合适的标签 –

@adam，我尝试了更多的方法，结果没有得到执行。我改变了上面的代码，如果可能的检查一次，让我知道主页执行的错误。结果是刚刚完成在x秒... – pupu

webscraping和保存JSON作为结果

回答

相关问题