2016-10-01 90 views
2

我正在尝试使用Python请求进行操作。这里是我的代码:Python,请求,线程,python请求关闭其套接字的速度有多快?

import threading 
import resource 
import time 
import sys 

#maximum Open File Limit for thread limiter. 
maxOpenFileLimit = resource.getrlimit(resource.RLIMIT_NOFILE)[0] # For example, it shows 50. 

# Will use one session for every Thread. 
requestSessions = requests.Session() 
# Making requests Pool bigger to prevent [Errno -3] when socket stacked in CLOSE_WAIT status. 
adapter = requests.adapters.HTTPAdapter(pool_maxsize=(maxOpenFileLimit+100)) 
requestSessions.mount('http://', adapter) 
requestSessions.mount('https://', adapter) 

def threadAction(a1, a2): 
    global number 
    time.sleep(1) # My actions with Requests for each thread. 
    print number = number + 1 

number = 0 # Count of complete actions 

ThreadActions = [] # Action tasks. 
for i in range(50): # I have 50 websites I need to do in parallel threads. 
    a1 = i 
    for n in range(10): # Every website I need to do in 3 threads 
     a2 = n 
     ThreadActions.append(threading.Thread(target=threadAction, args=(a1,a2))) 


for item in ThreadActions: 
    # But I can't do more than 50 Threads at once, because of maxOpenFileLimit. 
    while True: 
     # Thread limiter, analogue of BoundedSemaphore. 
     if (int(threading.activeCount()) < threadLimiter): 
      item.start() 
      break 
     else: 
      continue 

for item in ThreadActions: 
    item.join() 

但事实是,经过我得到50个线程时,该Thread limiter开始等待一些线程完成其工作。这是问题。在scrit前往限制器后,lsof -i|grep python|wc -l显示远远少于50个活动连接。但是在限制器之前它已经显示了所有的< = 50个过程。这是为什么发生?或者我应该使用requests.close()而不是requests.session()来阻止它使用已经运行的套接字?

+0

您的线程限制器进入一个紧密的循环,并消耗大部分处理时间。尝试像“睡眠(.1)”这样的放慢速度。更好的是,使用限制为50个请求的队列,让你的线程读取这些请求。 – tdelaney

+0

关于增加用户操作系统的限制,请查找[ulimit](http://stackoverflow.com/questions/6774724/why-python-has-limit-for-count-of-file-handles)和[fs .file-MAX](https://cs.uwaterloo.ca/~brecht/servers/openfiles.html)。在这样做之后,在增加python内部的限制时,请查找[setrlimit](https://coderwall.com/p/ptq7rw/increase-open-files-limit-and-drop-privileges-in-python)。当然,确保你没有不必要地运行busy-while-loop并且正确地复用你的代码。 – blackpen

+0

是的,我明白,并在我使用BoundedSemaphore的真实脚本。但是为什么在脚本达到极限之后,lsof -i | grep python | wc -l'显示的数字要低得多? – passwd

回答

1

您的限制器是一个紧密的循环,占用了大部分处理时间。使用线程池来限制工作人员数量。

import multiprocessing.pool 

# Will use one session for every Thread. 
requestSessions = requests.Session() 
# Making requests Pool bigger to prevent [Errno -3] when socket stacked in CLOSE_WAIT status. 
adapter = requests.adapters.HTTPAdapter(pool_maxsize=(maxOpenFileLimit+100)) 
requestSessions.mount('http://', adapter) 
requestSessions.mount('https://', adapter) 

def threadAction(a1, a2): 
    global number 
    time.sleep(1) # My actions with Requests for each thread. 
    print number = number + 1 # DEBUG: This doesn't update number and wouldn't be 
           # thread safe if it did 

number = 0 # Count of complete actions 

pool = multiprocessing.pool.ThreadPool(50, chunksize=1) 

ThreadActions = [] # Action tasks. 
for i in range(50): # I have 50 websites I need to do in parallel threads. 
    a1 = i 
    for n in range(10): # Every website I need to do in 3 threads 
     a2 = n 
     ThreadActions.append((a1,a2)) 

pool.map(ThreadActons) 
pool.close() 
+0

多处理工作比线程更快吗?这对处理器负载有何影响? – passwd

+0

它是一个权衡...和windows不同的是linux。使用多处理时,数据需要在父代和子代之间进行序列化(并且在Windows上,通常情况下,需要序列化更多的上下文,因为孩子没有得到父内存空间的克隆),但是您不必担心通过GIL。更高的CPU和/或更低的数据开销使得多处理效果更好。但是如果你主要是I/O绑定的话,线程池就可以。 – tdelaney