2010-09-16 59 views
2

我想在简单的线程中使用urllib3来获取几个wiki页面。 脚本将示例urllib3和蟒蛇中的线程

为每个线程创建1个连接(我不明白为什么)并永久挂起。 任何提示,建议或urllib3的简单的例子,线程

import threadpool 
from urllib3 import connection_from_url 

HTTP_POOL = connection_from_url(url, timeout=10.0, maxsize=10, block=True) 

def fetch(url, fiedls): 
    kwargs={'retries':6} 
    return HTTP_POOL.get_url(url, fields, **kwargs) 

pool = threadpool.ThreadPool(5) 
requests = threadpool.makeRequests(fetch, iterable) 
[pool.putRequest(req) for req in requests] 

@伦纳特的剧本得到这个错误:

http://en.wikipedia.org/wiki/2010-11_Premier_LeagueTraceback (most recent call last): 
    File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run 
http://en.wikipedia.org/wiki/List_of_MythBusters_episodeshttp://en.wikipedia.org/wiki/List_of_Top_Gear_episodes http://en.wikipedia.org/wiki/List_of_Unicode_characters result = request.callable(*request.args, **request.kwds) 
    File "crawler.py", line 9, in fetch 
    print url, conn.get_url(url) 
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url' 
Traceback (most recent call last): 
    File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run 
    result = request.callable(*request.args, **request.kwds) 
    File "crawler.py", line 9, in fetch 
    print url, conn.get_url(url) 
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url' 
Traceback (most recent call last): 
    File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run 
    result = request.callable(*request.args, **request.kwds) 
    File "crawler.py", line 9, in fetch 
    print url, conn.get_url(url) 
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url' 
Traceback (most recent call last): 
    File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run 
    result = request.callable(*request.args, **request.kwds) 
    File "crawler.py", line 9, in fetch 
    print url, conn.get_url(url) 
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url' 

加入import threadpool; import urllib3tpool = threadpool.ThreadPool(4) @ user318904的代码后得到这个错误:

Traceback (most recent call last): 
    File "crawler.py", line 21, in <module> 
    tpool.map_async(fetch, urls) 
AttributeError: ThreadPool instance has no attribute 'map_async' 

回答

1

很明显,它会为每个线程创建一个连接,每个线程应该怎样才能获取一个页面?并且您尝试使用同一个连接,由一个网址制作,适用于所有网址。这不可能是你想要的。

此代码工作得很好:

import threadpool 
from urllib3 import connection_from_url 

def fetch(url): 
    kwargs={'retries':6} 
    conn = connection_from_url(url, timeout=10.0, maxsize=10, block=True) 
    print url, conn.get_url(url) 
    print "Done!" 

pool = threadpool.ThreadPool(4) 
urls = ['http://en.wikipedia.org/wiki/2010-11_Premier_League', 
     'http://en.wikipedia.org/wiki/List_of_MythBusters_episodes', 
     'http://en.wikipedia.org/wiki/List_of_Top_Gear_episodes', 
     'http://en.wikipedia.org/wiki/List_of_Unicode_characters', 
     ] 
requests = threadpool.makeRequests(fetch, urls) 

[pool.putRequest(req) for req in requests] 
pool.wait() 
0

我用的是这样的:

#excluding setup for threadpool etc 

upool = urllib3.HTTPConnectionPool('en.wikipedia.org', block=True) 

urls = ['/wiki/2010-11_Premier_League', 
     '/wiki/List_of_MythBusters_episodes', 
     '/wiki/List_of_Top_Gear_episodes', 
     '/wiki/List_of_Unicode_characters', 
     ] 

def fetch(path): 
    # add error checking 
    return pool.get_url(path).data 

tpool = ThreadPool() 

tpool.map_async(fetch, urls) 

# either wait on the result object or give map_async a callback function for the results 
1

线程编程是很难的,所以我写了workerpool让你在做什么容易。

更具体而言,请参见Mass Downloader示例。

要做到同样的事情urllib3,它看起来是这样的:

import urllib3 
import workerpool 

pool = urllib3.connection_from_url("foo", maxsize=3) 

def download(url): 
    r = pool.get_url(url) 
    # TODO: Do something with r.data 
    print "Downloaded %s" % url 

# Initialize a pool, 5 threads in this case 
pool = workerpool.WorkerPool(size=5) 

# The ``download`` method will be called with a line from the second 
# parameter for each job. 
pool.map(download, open("urls.txt").readlines()) 

# Send shutdown jobs to all threads, and wait until all the jobs have been completed 
pool.shutdown() 
pool.wait() 

对于更复杂的代码,看看workerpool.EquippedWorker(和the tests here例如使用)。你可以让游泳池成为你通过的toolbox