Python-Wikipedia自动下载器

[使用Python 3.1]有没有人有任何想法如何让一个Python 3应用程序允许用户写一个文本文件与多个词用逗号分隔。程序应该读取文件，并下载所需项目的维基百科页面。例如如果他们输入你好，蟒蛇-3，鸡它会去维基百科和下载http://www.wikipedia.com/wiki/hello,http://www.wikip ......任何人都认为他们可以做到这一点？Python-Wikipedia自动下载器

当我说“下载”时，我的意思是下载文本，无关图像。

来源

2011-03-11 Alex

这听起来像是我的功课。如果您希望得到一些帮助，请付出一些努力并向我们展示一些代码。 – ierax 2011-03-11 21:28:51

我有一个想法，如何使它，是的。告诉我你的，我会告诉你我的。 – 2011-03-11 21:30:48

^^^^ bofh！ ^^^^ – tiagoboldt 2011-03-11 21:42:11

查找urllib.request。

来源

2011-03-12 11:21:46

您描述了如何制作这样的程序。那么问题是什么？

您阅读文件，以逗号分隔并下载URL。完成！

来源

2011-03-12 09:46:15

我知道如何做额外的东西，阅读文本文件...但我不知道如何做下载的页面？ – Alex 2011-03-12 10:28:44

@Alex：您使用urllib。 – 2011-03-12 12:48:37

检查下面的代码，它下载的HTML，没有图像，但你可以从正在被解析的XML文件，以获得URL访问它们。

from time import sleep 
import urllib 
import urllib2 
from xml.dom import minidom, Node 

def main(): 
    print "Hello World" 

    keywords = [] 

    key_file = open("example.txt", 'r') 
    if key_file: 
     temp_lines = key_file.readlines() 

     for keyword_line in temp_lines: 
      keywords.append(keyword_line.rstrip("\n")) 

     key_file.close() 

    print "Total keywords: %d" % len(keywords) 
    for keyword in keywords: 
     url = "http://en.wikipedia.org/w/api.php?format=xml&action=opensearch&search=" + keyword 
     xmldoc = minidom.parse(urllib.urlopen(url)) 
     root_node = xmldoc.childNodes[0] 

     section_node = None 
     for node in root_node.childNodes: 
      if node.nodeType == Node.ELEMENT_NODE and \ 
      node.nodeName == "Section": 
       section_node = node 
       break 

     if section_node is not None: 
      items = [] 
      for node in section_node.childNodes: 
       if node.nodeType == Node.ELEMENT_NODE and \ 
       node.nodeName == "Item": 
        items.append(node) 

      if len(items) == 0: 
       print "NO results found" 
      else: 
       print "\nResults found for " + keyword + ":\n" 
       for item in items: 
        for node in item.childNodes: 
         if node.nodeType == Node.ELEMENT_NODE and \ 
         node.nodeName == "Text": 
          if len(node.childNodes) == 1: 
           print node.childNodes[0].data.encode('utf-8') 

       file_name = None 
       for node in items[0].childNodes: 
        if node.nodeType == Node.ELEMENT_NODE and \ 
        node.nodeName == "Text": 
         if len(node.childNodes) == 1: 
          file_name = "Html\%s.html" % node.childNodes[0].data.encode('utf-8') 
          break 

       if file_name is not None: 
        file = open(file_name, 'w') 
        if file: 
         for node in items[0].childNodes: 
          if node.nodeType == Node.ELEMENT_NODE and \ 
          node.nodeName == "Url": 
           if len(node.childNodes) == 1: 
            user_agent = 'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)' 
            header = { 'User-Agent' : user_agent } 
            request = urllib2.Request(url=node.childNodes[0].data, headers=header) 
            file.write(urllib2.urlopen(request).read()) 
            file.close() 
            break 


    print "Sleeping" 
    sleep(2) 

if __name__ == "__main__": 
    main()

来源

2012-08-03 22:05:01 JunkTester

你不应该用代码回答作业问题，特别是当提问者显示少量代码和许多“想法”时， – hayalci 2012-10-04 21:14:30

Python-Wikipedia自动下载器

回答

相关问题