2016-12-02 53 views
1

我正在处理发生在nodejs应用程序中的奇怪的100%CPU使用问题。该应用程序非常庞大,我不确定它造成这种情况的位置。该应用程序由cluster_mode中的pm2进行管理。Nodejs由于clock_gettime/gettimeofday/futex,100%的CPU使用率?

我所知道的是在CPU使用率过高时,strace输出这样的:

[email protected]:/# strace -p 4350 -c 
Process 4350 attached 
% time  seconds usecs/call  calls errors syscall 
------ ----------- ----------- --------- --------- ---------------- 
100.00 0.000031   0  3388   clock_gettime 
    0.00 0.000000   0   1   read 
    0.00 0.000000   0   2   write 
    0.00 0.000000   0   1   rt_sigreturn 
------ ----------- ----------- --------- --------- ---------------- 
100.00 0.000031     3392   total 

[email protected]:~# strace -p 3367 -r -c 
Process 3367 attached 
^CProcess 3367 detached 
% time  seconds usecs/call  calls errors syscall 
------ ----------- ----------- --------- --------- ---------------- 
75.00 0.000939   0  91973   gettimeofday 
25.00 0.000313   0  39417   clock_gettime 
------ ----------- ----------- --------- --------- ---------------- 
100.00 0.001252    131390   total 

整个应用程序是在这一点上反应迟钝。约5分钟后,pm2会捡起并重新启动过程中由于“零内存报”:

2016-12-03-20:29:05 PM2 [PM2][WORKER] Process 1 restarted because it uses 0 memory and has ONLINE status 
2016-12-03-20:29:05 PM2 Stopping app:api-v2 id:1 
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying... 
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying... 
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying... 
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying... 
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying... 
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying... 
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying... 
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying... 
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying... 
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying... 
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying... 
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying... 
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying... 
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying... 
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying... 
2016-12-03-20:29:07 PM2 Process with pid 3367 still alive after 1600ms, sending it SIGKILL now... 
2016-12-03-20:29:07 PM2 App name:api-v2 id:1 disconnected 
2016-12-03-20:29:07 PM2 App [api-v2] with id [1] and pid [3367], exited with code [0] via signal [SIGKILL] 
2016-12-03-20:29:07 PM2 Starting execution sequence in -cluster mode- for app name:api-v2 id:1 
2016-12-03-20:29:07 PM2 App name:api-v2 id:1 online 

显然,这是由PM2错误引起的:https://github.com/Unitech/pm2/issues/2492。但是,如果他们修正了这个错误,它将不会重新启动该进程,从而让它停滞不前,现在我别无选择,只能继续使用旧版本。

如果我使用timestrace启动进程,那么:

real 0m45.765s 
user 0m3.349s 
sys 0m0.340s 
[email protected]:~/$ strace -cf node /var/www/api-v2.js 
Process 4020 attached 
... 
% time  seconds usecs/call  calls errors syscall 
------ ----------- ----------- --------- --------- ---------------- 
99.09 9.247853  4538  2038  26 futex 
    0.18 0.016793   1  17777   clock_gettime 
    0.16 0.015384   12  1262   epoll_wait 
    0.11 0.010522   116  91   poll 
    0.09 0.008339   2  5237  2437 stat 
    0.08 0.007856   6  1234   write 
    0.05 0.004309   3  1571   close 
    0.03 0.003150   2  1790   read 
    0.03 0.003150   2  1333  248 open 
    0.03 0.003046   11  265   mmap 
    0.02 0.002049   2  1186   lstat 
    0.02 0.001617   4  378   madvise 
    0.02 0.001535   2  917   fstat 
    0.02 0.001518   1  1773   gettimeofday 
    0.01 0.001096   1  1224  35 epoll_ctl 
    0.01 0.000983   3  329  37 connect 
    0.01 0.000792   1  667  329 accept4 
    0.01 0.000734   10  76   brk 
    0.01 0.000617   2  338   pread 
    0.00 0.000315   2  155   socket 
    0.00 0.000265   9  30   sendmmsg 
    0.00 0.000184   1  144   munmap 
    0.00 0.000162   1  113   mprotect 
    0.00 0.000125   4  35   sendto 
    0.00 0.000114   7  16   setsockopt 
    0.00 0.000078   1  60   recvfrom 
    0.00 0.000071   1  105   recvmsg 
    0.00 0.000064   2  35   writev 
    0.00 0.000052   7   8   clone 
    0.00 0.000049   2  20  20 access 
    0.00 0.000043   0  192   getsockname 
    0.00 0.000029   7   4   getdents 
    0.00 0.000024   1  36   bind 
    0.00 0.000023   23   1   readlink 
    0.00 0.000020   1  35   getsockopt 
    0.00 0.000019   19   1   execve 
    0.00 0.000018   0  86   9 ioctl 
    0.00 0.000011   2   5   rt_sigprocmask 
    0.00 0.000009   5   2   openat 
    0.00 0.000006   1  11   getcwd 
    0.00 0.000005   5   1   lseek 
    0.00 0.000005   0  35   rt_sigaction 
    0.00 0.000003   3   1   arch_prctl 
    0.00 0.000000   0   1   listen 
    0.00 0.000000   0  14   uname 
    0.00 0.000000   0   2   getrlimit 
    0.00 0.000000   0   2   getuid 
    0.00 0.000000   0   1   getgid 
    0.00 0.000000   0   1   geteuid 
    0.00 0.000000   0   1   getegid 
    0.00 0.000000   0   4   prctl 
    0.00 0.000000   0   1   setrlimit 
    0.00 0.000000   0   1   set_tid_address 
    0.00 0.000000   0   1   clock_getres 
    0.00 0.000000   0   9   set_robust_list 
    0.00 0.000000   0   1   eventfd2 
    0.00 0.000000   0   1   epoll_create1 
    0.00 0.000000   0   2   dup3 
    0.00 0.000000   0   2   pipe2 
------ ----------- ----------- --------- --------- ---------------- 
100.00 9.333037     40661  3141 total 

我没有在我的代码的任何setTimeout电话,但我想我有一个不依赖。我查阅了最近的更改,它似乎不涉及递归调用或永不结束的循环。

我观察到零内存泄漏,即pm2内存大小不会增加超时。以前相同的程序运行了2个月而没有重新启动,负载类似。服务器具有更多的CPU,RAM,交换资源。

这个问题在Ubuntu的一些常规维护(升级nodejs的apt-get升级,以及mongodb和npm依赖性升级)后开始出现。 nodejs升级从4.6.1升级到4.6.2。但是当我降级到4.6.1时,问题仍然存在。我尝试了4.4.7和6.9.1,没有任何版本看起来没有问题。

如何调试此问题?我从哪说起呢?

+0

是否可以记录传入呼叫以构建模拟以在调试环境中进入该状况? – Bernhard

+0

你是否正在运行pm2观看文件更改?我会禁用,看看是否导致高CPU使用率。 –

+0

你需要更多地缩小这个范围。这是否仅在特定位代码被执行时才会发生?你可以在不同的物理机器上重现它吗? – tadman

回答

0

我的问题中的调试技术是不正确的。我只是在进行这样的调试,因为Google搜索结果指出了我在搜索'nodejs 100%cpu利用率'时的这种方式。他们原来是误导人的。

正确的技术是让节点本身允许通过node --debug=7000进行调试。在CPU利用率高的时候,运行调试客户端node debug localhost:7001。并通过pause暂停执行。尝试暂停并运行几次,您将能够确定执行的位置。

原来是一个无限循环的情况,即for (i=10; i>=0; i++)

如果其他人搜索类似行为的解决方案,我将在此留下问题和答案。