2012-02-02 22 views
0

我已经看到了谷歌提取的结果,但它不适用于此。我想简单地进入代码并更改参数,并在运行时执行搜索并删除职位,地点和日期。这是我迄今为止所拥有的。任何帮助将是伟大的,并提前感谢。Python提取来自monster.com的搜索结果

我会脚本在给定的参数(工程师软件CA)上执行monster.com上的搜索,并刮去结果。

#! /usr/bin/python 
import re 
import requests 
from urllib import urlopen 
from BeautifulSoup import BeautifulSoup 

parameters = ["Software","Engineer","CA"] 
base_url = "http://careers.boozallen.com/search?q=" 
search_string = "+".join(parameters) 

final_url = base_url + search_string 

a = requests.get(final_url) 
raw_string = a.text.strip() 


soup = BeautifulSoup(raw_string) 

job_urls = soup.findAll(name = 'a', attrs = { 'class': 'jobTitle fnt11_js' }) 

for job_url in job_urls: 

    print job_url.text 
    print 

raw_input("Press enter to close: ") 

我知道这个,下面,作为一个标准刮。

handle = urlopen("http://jobsearch.monster.com/search/Engineer_5?q=Software&where=AZ&rad=20&sort=rv.di.dt") 
responce = handle.read() 
soup = BeautifulSoup(responce) 

job_urls = soup.findAll(name = 'a', attrs = { 'class': 'jobTitle fnt11_js' }) 
for job_url in job_urls: 
    print job_url.text 
    print 
+0

也许你需要把 “&”,而不是 “+” 在您的search_string的顶部? – 2012-02-02 16:39:24

+0

尝试过,仍然没有结果。谢谢。为什么这被标记下来?即时通讯只是要求我的项目帮助。我认为它会工作,并需要帮助 – Garrett 2012-02-02 16:41:57

+0

你到底在找什么?如果您对某些无法正常工作的问题有任何疑问,可以提出问题,但我们无法为您解决问题。 – silent1mezzo 2012-02-02 16:33:23

回答

1

如果您在http://careers.boozallen.com/search?q=software+engineer+CA点你的浏览器,并检查HTML你会看到HTML这样的:

<tr class="dbOutputRow2"> 
    <td style="width: 400px;" class="colTitle" headers="hdrTitle"><span class="jobTitle"><a href="http://careers.boozallen.com/job/San-Diego-Network-Engineer%2C-Senior-Job-CA-92101/1645793/">Network Engineer, Senior Job</a></span></td> 
    <td style="width: auto;" class="colLocation" headers="hdrLocation"><span class="jobLocation">San Diego, CA, US</span></td> 
    <td style="width: 155px;" class="colDate" headers="hdrDate" nowrap="nowrap"><span class="jobDate">Jan 5, 2012</span></td> 

你正在寻找的信息是<span>标签,与class属性等于jobTitlejobLocationjobDate

这里是你如何能使用lxml刮这些位:

import urllib2 
import lxml.html as LH 

url = 'http://careers.boozallen.com/search?q=software+engineer+CA' 
doc = LH.parse(urllib2.urlopen(url)) 

def text_content(iterable): 
    for elt in iterable: 
     yield elt.text_content() 

data = text_content(doc.xpath('''//span[@class = "jobTitle" 
             or @class = "jobLocation" 
             or @class = "jobDate"]''')) 

for title, location, date in zip(*[data]*3): 
    print(title,location,date) 

产生

('Title', 'Location', 'Date') 
('Network Engineer, Senior Job', 'San Diego, CA, US', 'Jan 5, 2012') 
('Network Integration Engineer, Mid Job', 'San Diego, CA, US', 'Jan 12, 2012') 
('Systems Engineer, Senior Job', 'San Diego, CA, US', 'Jan 31, 2012') 
('Enterprise Architect, Senior Job', 'Washington, DC, US', 'Jan 23, 2012') 
... 
+0

感谢您的回应,但无论如何改变一下东西,以处理导入xml.etree.ElementTree等。在我目前的环境中,我无法使用lxml。谢谢。 – Garrett 2012-02-03 16:07:35

相关问题