通过web表单提交数据并提取结果

我的python级别是新手。我从未写过网络抓取工具或抓取工具。我已经写了一个python代码连接到api并提取我想要的数据。但是对于一些提取的数据，我想获得作者的性别。我发现这个网站http://bookblog.net/gender/genie.php但缺点是没有可用的api。我想知道如何编写一个Python提交数据到页面中的表单并提取返回数据。如果我能得到一些指导，这将是一个很大的帮助。通过web表单提交数据并提取结果

这是表单DOM：

<form action="analysis.php" method="POST"> 
<textarea cols="75" rows="13" name="text"></textarea> 
<div class="copyright">(NOTE: The genie works best on texts of more than 500 words.)</div> 
<p> 
<b>Genre:</b> 
<input type="radio" value="fiction" name="genre"> 
fiction&nbsp;&nbsp; 
<input type="radio" value="nonfiction" name="genre"> 
nonfiction&nbsp;&nbsp; 
<input type="radio" value="blog" name="genre"> 
blog entry 
</p> 
<p> 
</form>

结果页面的DOM：

<p> 
<b>The Gender Genie thinks the author of this passage is:</b> 
male! 
</p>

来源

2011-12-04 Null-Hypothesis

无需使用机械化，只需在POST请求中发送正确的表单数据即可。

此外，使用正则表达式来解析HTML是一个坏主意。你最好使用像lxml.html这样的HTML解析器。

import requests 
import lxml.html as lh 


def gender_genie(text, genre): 
    url = 'http://bookblog.net/gender/analysis.php' 
    caption = 'The Gender Genie thinks the author of this passage is:' 

    form_data = { 
     'text': text, 
     'genre': genre, 
     'submit': 'submit', 
    } 

    response = requests.post(url, data=form_data) 

    tree = lh.document_fromstring(response.content) 

    return tree.xpath("//b[text()=$caption]", caption=caption)[0].tail.strip() 


if __name__ == '__main__': 
    print gender_genie('I have a beard!', 'blog')

来源

2011-12-04 17:59:32 Acorn

我试图做easy_install lxml.html，但得到以下错误easy_install lxml.html 正在搜索lxml.html 阅读http://pypi.python.org/simple/lxml .html/ 找不到'lxml.html'的索引页（可能是拼写错误？）所有软件包的扫描索引（这可能需要一段时间）正在读取http://pypi.python.org/simple/ 否为lxml.html 找到的本地程序包或下载链接错误：找不到Requirement.parse（'lxml.html'）的合适分布 –

在模块导入中，如果两个名称之间具有“。”，则意味着第二个名字是以前的名字。你想要安装的模块是lxml。 – Acorn

谢谢我在发表评论后意识到了这一点。谢谢agianl –

您可以使用mechanize，见examples了解详情。

from mechanize import ParseResponse, urlopen, urljoin 

uri = "http://bookblog.net" 

response = urlopen(urljoin(uri, "/gender/genie.php")) 
forms = ParseResponse(response, backwards_compat=False) 
form = forms[0] 

#print form 

form['text'] = 'cheese' 
form['genre'] = ['fiction'] 

print urlopen(form.click()).read()

来源

2011-12-04 17:39:26

非常感谢您的回复。听起来像machanize是我安装的模块？在终端上快速测试得到了无模块错误。我不是一个mac，我应该能够做easy_install来获得machanize。 –

哦，对，它是一个外部模块。是的，你可以做easy_install机械化。 –

您可以使用mechanize提交和检索内容，以及re模块得到你想要的东西。例如，下面的脚本是针对您自己的问题的文本：

import re 
from mechanize import Browser 

text = """ 
My python level is Novice. I have never written a web scraper 
or crawler. I have written a python code to connect to an api and 
extract the data that I want. But for some the extracted data I want to 
get the gender of the author. I found this web site 
http://bookblog.net/gender/genie.php but downside is there isn't an api 
available. I was wondering how to write a python to submit data to the 
form in the page and extract the return data. It would be a great help 
if I could get some guidance on this.""" 

browser = Browser() 
browser.open("http://bookblog.net/gender/genie.php") 

browser.select_form(nr=0) 
browser['text'] = text 
browser['genre'] = ['nonfiction'] 

response = browser.submit() 

content = response.read() 

result = re.findall(
    r'<b>The Gender Genie thinks the author of this passage is:</b> (\w*)!', content) 

print result[0]

它是做什么的？它创建了一个mechanize.Browser并转到指定的URL：

browser = Browser() 
browser.open("http://bookblog.net/gender/genie.php")

然后选择形式（由于只有一个要填写的表格，这将是第一个）：

browser.select_form(nr=0)

而且，它集形式的条目...

browser['text'] = text 
browser['genre'] = ['nonfiction']

...并提交：

response = browser.submit()

现在，我们得到的结果：

content = response.read()

我们知道结果的形式是：

<b>The Gender Genie thinks the author of this passage is:</b> male!

所以我们创建了一个正则表达式匹配和使用re.findall()：

result = re.findall(
    r'<b>The Gender Genie thinks the author of this passage is:</b> (\w*)!', 
    content)

现在结果可供您使用：

print result[0]

来源

2011-12-04 17:48:29 brandizzi

非常感谢这对于像我这样的新人很好的解释是一个梦幻般的答案。希望我可以不止一次地upvote ..;） –

通过web表单提交数据并提取结果

回答

相关问题