2012-09-12 103 views
-1

:-)机械化 - 无法解析自动提交POST表单

我想下载一个页面,填写表单并提交它。我喜欢python并碰到机械化。我可以成功下载网页,验证页面中有2个表单,但即使我可以确认机械化下载的网页数据明确包含第二个表单,机械化不会识别第二个表单(方法POST)。因此,我甚至无法修改这些值并提交我感兴趣的表单。我在OS X 10.6.8上的Python 2.6.1上。任何建议非常感谢。

我的代码

import mechanize 
br = mechanize.Browser() 
br.set_handle_robots(False) # no robots 
br.set_handle_refresh(False) # can sometimes hang without this 
br.addheaders = [('User-agent', 'Mozilla/6.0 (X11; U; i686; en-US; rv:1.9.0.1) Gecko/2008071615 OS X 10.2 Firefox/3.0.1')] 
url = 'http://www.abcd.com/test.html' 
response = br.open(url) 

我可以验证使用response.read()或GET_DATA(),有两种形式,如下面

<form id="lookupFormX" action="/lookup/" onSubmit="return submitLookupForm('lookupForm', 'download');" method="GET"> 

       <label style="font-weight:normal; font-size:85%; margin-right:5px;">View a Site Report </label> 
       <input type="hidden" name="facet" style="margin-right:2px; font-weight:normal; font-size:85%;" value="sitereport" readonly/> 

       <input style="margin-right:2px; font-weight:normal; font-size:85%;" name="q" type="text" id="railtext_v11pt" value="e.g. yahoo.com" 
         onfocus="clearDefaultNote(this,'e.g. yahoo.com');" 
         onblur="addDefaultNote(this,'e.g. yahoo.com');" /> 
       <a style="margin-right:10px;" href="#" onclick="submitLookupForm('lookupFormX');"><img src="/images/nav_right.gif" /></a> 
      </form> 

<br> 
<FORM action="userfeedbackpost.html" id="friendForm" name="friendForm" method="post"> 
<TABLE id="userfeedbacktable" BORDER=0 style="padding:left:0px; margin-left:0px;"> 
    <TR> 
     <TD style="width:200px;padding-left:10px">Your Name:</TD> 
     <TD style="width:200px" ><input name="your_name" type="text" SIZE=35/></TD> 

     <TD style="width:250px;text-align:right;padding-right:10px">Your E-mail:</TD> 
     <TD style="width:140px" ><input name="your_email" type="text" SIZE=35/></TD> 
    </TR> 
    <!-- <TR></TR> --> 
    <TR> 
     <TD style="width:200px;padding-left:10px">Subject:</TD> 
     <TD colspan="3" ><input name="subject" type="text" style="width:648px" SIZE=106/></TD> 
    </TR> 
    <!-- <TR></TR> --> 
    <TR> 
     <TD style="width:200px;padding-left:10px">URL this concerns:</TD> 
     <TD colspan="3" ><input name="url" type="text" style="width:648px" SIZE=106/></TD> 
    </TR> 
    <!-- <TR></TR> --> 
    <TR> 
     <TD style="width:200px;padding-left:10px">User ID:</TD> 
     <TD style="width:200px" ><input name="test_id" type="text" SIZE=35/></TD> 

     <TD style="width:250px;text-align:right;padding-right:10px">Type of inquiry:</TD> 
     <TD style="width:140px" > 
      <SELECT name="type" id="type" style="width:262px" onchange="makeSelection()"> 
       <OPTION value="Choose">Choose One</OPTION> 
       <OPTION value="Bug report">Report an error</OPTION> 
       <OPTION value="Helpful Information">Send us a suggestion</OPTION> 
       <OPTION value="Other">Other</OPTION> 
      </SELECT> 
     </TD> 
    </TR> 
    <!-- <TR></TR> --> 
    <TR id="infoPanel" style="display:none"> 
     <TD style="width:200px;padding-left:10px">Facet in question:</TD> 
     <TD style="width:200px" > 
      <SELECT name="facet" style="width:263px" id="facet"> 
       <OPTION selected value="Choose">Choose One</OPTION> 
       <OPTION value="Annoyances">Annoyances</OPTION> 
       <OPTION value="Downloads">Downloads</OPTION> 
       <OPTION value="Links">Links</OPTION> 
      </SELECT> 
     </TD> 

     <TD style="width:250px;text-align:right;padding-right:10px">Are you the site owner?:</TD> 
     <TD style="width:140px" > 
      <input type="radio" id="siteowner_yes" name="siteowner" value="Yes">&nbsp;Yes&nbsp;&nbsp;&nbsp; 
      <input type="radio" id="siteowner_no" name="siteowner" value="No" checked>&nbsp;No 
     </TD> 
    </TR> 
    <!-- <TR></TR> --> 
    <TR> 
     <TD style="width:200px;padding-left:10px" >Your Message:</TD> 
     <TD colspan=3><textarea class=userfeedbackTA NAME=message ROWS=12 COLS=80 style="width:646px;"></textarea></TD> 
    </TR> 
    <!-- <TR></TR> --> 
</TABLE> 

<br/><br/> <a href="javascript:document.getElementById('friendForm').submit();" class="btnOrangeLrg"><span>Send Your Feedback or Question.</span></a><br/> 
<br/><br/> P.S. We will use the information above only to help provide you feedback. This information will not be used for any other purpose. 

</FORM> 

仅机械化显示以下内容:

Form name: None 
<GET http://www.test.com/lookup/ application/x-www-form-urlencoded 
    <HiddenControl(facet=sitereport) (readonly)> 
    <TextControl(q=e.g. yahoo.com)>> 

当我用下面的代码

for form in br.forms(): 
    print "Form name:", form.name 
    print form 

我的问题: - 我如何才能访问第二个表单? (使用NR = 1给了我一个错误)

编辑:

我想这一个版本的太多,同样的结果,第二个表格将不会显示出来:

request = mechanize.Request(url) 
request.add_header("User-agent", "Mozilla/6.0 (X11; U; i686; en-US; rv:1.9.0.1) Gecko/2008071615 OS X 10.2 Firefox/3.0.1") 
response = mechanize.urlopen(request) 
forms = mechanize.ParseResponse(response, backwards_compat=False) 
response.close() 

for form in forms: 
    print form 

编辑2

我也试图改变我的代码看起来像这样:

# Cookie Jar 
cj = cookielib.LWPCookieJar() 
br.set_cookiejar(cj) 

# Browser options 
br.set_handle_equiv(True) 
br.set_handle_redirect(True) 
br.set_handle_referer(True) 
br.set_handle_robots(False) 

# Follows refresh 0 but not hangs on refresh > 0 
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1) 

br.addheaders = [ 
('Cookie','mbox=PC#1327356910232-537677#1410633293|check#true#1347561353|session#1347561287712-498080#1347563153; s_cc=true; s_sq=%5B%5BB%5D%5D; s_nr=1347561671754-Repeat'),\ 
('Accept-Charset','ISO-8859-1,utf-8;q=0.7,*;q=0.3'),\ 
('Accept-Encoding','gzip,deflate,sdch'),\ 
('Accept-Language','en-US,en'),\ 
('Cache-Control','max-age=0'),\ 
('Connection','keep-alive'),\ 
('Referer','http://www.siteadvisor.com'),\ 
('User-Agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1') 
] 

我的M拾起头值y浏览器并尝试将它们插入机械化浏览器实例中。然而,我只能看到这种形式。

回答

0

你应该已经提供的网址,因为如果我把与你的形式给定的HTML中text变量发生这种情况:

In [61]: forms = mech.ParseString(text, 'fake') # imported mechanize as mech 

In [62]: for form in forms: print form; print '-'*5 
    ....: 
<GET fake application/x-www-form-urlencoded> 
----- 
<GET /lookup/ application/x-www-form-urlencoded 
    <HiddenControl(facet=sitereport) (readonly)> 
    <TextControl(q=e.g. yahoo.com)>> 
----- 
<friendForm POST userfeedbackpost.html application/x-www-form-urlencoded 
    <TextControl(your_name=)> 
    <TextControl(your_email=)> 
    <TextControl(subject=)> 
    <TextControl(url=)> 
    <TextControl(test_id=)> 
    <SelectControl(type=[*Choose, Bug report, Helpful Information, Other])> 
    <SelectControl(facet=[*Choose, Annoyances, Downloads, Links])> 
    <RadioControl(siteowner=[Yes, *No])> 
    <TextareaControl(message=)>> 
----- 

首先是默认的(通过解析加)忽略它。好快乐的两种形式。

+0

是的,你是对的。该网址是http://www.siteadvisor.com/userfeedback.html – JPK

2

我和你有类似的问题,我发现修改我的头文件,并且包含RobustFactory()来处理'坏'HTML解决了这个问题。

 
`br = mechanize.Browser(factory=mechanize.RobustFactory()) 
br.set_handle_robots(False) 
br.addheaders = [('User-agent','Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6')]` 

这是经过很多摆弄他们。该解决方案适用于一般情况,以及具体的网址我使用,但补充说:

 
br.addheaders.append(['Accept-Encoding', 'gzip'])

可能是必要的,如果你试图访问的URL是GZip压缩。你可以检查这里是否是这种情况:http://checkgzipcompression.com/