2016-10-13 131 views
0

我需要一些帮助。我的输出看起来不对。我怎样才能正确追加dept,job_title,job_location的值。并且存在具有dept值的html标签。我如何删除这些标签。python append()并删除html标签

我的代码

response = requests.get("http://hortonworks.com/careers/open-positions/") 
soup = BeautifulSoup(response.text, "html.parser") 

jobs = [] 


div_main = soup.select("div#careers_list") 


for div in div_main: 
    dept = div.find_all("h4", class_="department_title") 
    div_career = div. find_all("div", class_="career") 
    title = [] 
    location = [] 
    for dv in div_career: 
     job_title = dv.find("div", class_="title").get_text().strip() 
     title.append(job_title) 
     job_location = dv.find("div", class_="location").get_text().strip() 
     location.append(job_location) 

    job = { 
     "job_location": location, 
     "job_title": title, 
     "job_dept": dept 
    } 
    jobs.append(job) 
pprint(jobs) 

它应该看起来像

{ 'job_dept':咨询,

'job_location': '芝加哥,IL'

'JOB_TITLE':SR顾问 - 中央'

每个变量的1个值。

+1

请出示你的输出... –

+0

输出将显示,job_dept:所有部门,工作_location:所有位置,job_title:所有标题 –

回答

0

HTML的结构是连续的,不分层,所以你必须通过你的工作清单和更新部门标题重复,当您去:

import requests 
from bs4 import BeautifulSoup, Tag 
from pprint import pprint 
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:21.0) Gecko/20130331 Firefox/21.0'} 
response = requests.get("http://hortonworks.com/careers/open-positions/", headers=headers) 

soup = BeautifulSoup(response.text, "html.parser") 

jobs = [] 


div_main = soup.select("div#careers_list") 


for div in div_main: 
    department_title = "" 
    for element in div: 
     if isinstance(element, Tag) and "class" in element.attrs: 
      if "department_title" in element.attrs["class"]: 
       department_title = element.get_text().strip() 
      elif "career" in element.attrs["class"]: 
       location = element.select("div.location")[0].get_text().strip() 
       title = element.select("div.title")[0].get_text().strip() 
       job = { 
        "job_location": location, 
        "job_title": title, 
        "job_dept": department_title 
       } 
       jobs.append(job) 

pprint(jobs) 
+0

我有这个错误,当我运行这个。如果isinstance(element,Tag)和element.attrs.has_key(“class”): AttributeError:'dict'对象没有属性'has_key' –

+0

我更新了我的答案,所以它可以与python3一起使用。 – nullop

+0

哇。惊人。它运作良好。输出是正确的..我使用pycharm。部分“job_dept”:department_title。 department_title被突出显示。它说:名称'department_title'可以不定义 –