我正在处理发生在nodejs应用程序中的奇怪的100%CPU使用问题。该应用程序非常庞大,我不确定它造成这种情况的位置。该应用程序由cluster_mode中的pm2进行管理。Nodejs由于clock_gettime/gettimeofday/futex,100%的CPU使用率?
我所知道的是在CPU使用率过高时,strace
输出这样的:
[email protected]:/# strace -p 4350 -c
Process 4350 attached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.000031 0 3388 clock_gettime
0.00 0.000000 0 1 read
0.00 0.000000 0 2 write
0.00 0.000000 0 1 rt_sigreturn
------ ----------- ----------- --------- --------- ----------------
100.00 0.000031 3392 total
[email protected]:~# strace -p 3367 -r -c
Process 3367 attached
^CProcess 3367 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
75.00 0.000939 0 91973 gettimeofday
25.00 0.000313 0 39417 clock_gettime
------ ----------- ----------- --------- --------- ----------------
100.00 0.001252 131390 total
整个应用程序是在这一点上反应迟钝。约5分钟后,pm2
会捡起并重新启动过程中由于“零内存报”:
2016-12-03-20:29:05 PM2 [PM2][WORKER] Process 1 restarted because it uses 0 memory and has ONLINE status
2016-12-03-20:29:05 PM2 Stopping app:api-v2 id:1
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still alive after 1600ms, sending it SIGKILL now...
2016-12-03-20:29:07 PM2 App name:api-v2 id:1 disconnected
2016-12-03-20:29:07 PM2 App [api-v2] with id [1] and pid [3367], exited with code [0] via signal [SIGKILL]
2016-12-03-20:29:07 PM2 Starting execution sequence in -cluster mode- for app name:api-v2 id:1
2016-12-03-20:29:07 PM2 App name:api-v2 id:1 online
显然,这是由PM2错误引起的:https://github.com/Unitech/pm2/issues/2492。但是,如果他们修正了这个错误,它将不会重新启动该进程,从而让它停滞不前,现在我别无选择,只能继续使用旧版本。
如果我使用time
和strace
启动进程,那么:
real 0m45.765s
user 0m3.349s
sys 0m0.340s
[email protected]:~/$ strace -cf node /var/www/api-v2.js
Process 4020 attached
...
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.09 9.247853 4538 2038 26 futex
0.18 0.016793 1 17777 clock_gettime
0.16 0.015384 12 1262 epoll_wait
0.11 0.010522 116 91 poll
0.09 0.008339 2 5237 2437 stat
0.08 0.007856 6 1234 write
0.05 0.004309 3 1571 close
0.03 0.003150 2 1790 read
0.03 0.003150 2 1333 248 open
0.03 0.003046 11 265 mmap
0.02 0.002049 2 1186 lstat
0.02 0.001617 4 378 madvise
0.02 0.001535 2 917 fstat
0.02 0.001518 1 1773 gettimeofday
0.01 0.001096 1 1224 35 epoll_ctl
0.01 0.000983 3 329 37 connect
0.01 0.000792 1 667 329 accept4
0.01 0.000734 10 76 brk
0.01 0.000617 2 338 pread
0.00 0.000315 2 155 socket
0.00 0.000265 9 30 sendmmsg
0.00 0.000184 1 144 munmap
0.00 0.000162 1 113 mprotect
0.00 0.000125 4 35 sendto
0.00 0.000114 7 16 setsockopt
0.00 0.000078 1 60 recvfrom
0.00 0.000071 1 105 recvmsg
0.00 0.000064 2 35 writev
0.00 0.000052 7 8 clone
0.00 0.000049 2 20 20 access
0.00 0.000043 0 192 getsockname
0.00 0.000029 7 4 getdents
0.00 0.000024 1 36 bind
0.00 0.000023 23 1 readlink
0.00 0.000020 1 35 getsockopt
0.00 0.000019 19 1 execve
0.00 0.000018 0 86 9 ioctl
0.00 0.000011 2 5 rt_sigprocmask
0.00 0.000009 5 2 openat
0.00 0.000006 1 11 getcwd
0.00 0.000005 5 1 lseek
0.00 0.000005 0 35 rt_sigaction
0.00 0.000003 3 1 arch_prctl
0.00 0.000000 0 1 listen
0.00 0.000000 0 14 uname
0.00 0.000000 0 2 getrlimit
0.00 0.000000 0 2 getuid
0.00 0.000000 0 1 getgid
0.00 0.000000 0 1 geteuid
0.00 0.000000 0 1 getegid
0.00 0.000000 0 4 prctl
0.00 0.000000 0 1 setrlimit
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 clock_getres
0.00 0.000000 0 9 set_robust_list
0.00 0.000000 0 1 eventfd2
0.00 0.000000 0 1 epoll_create1
0.00 0.000000 0 2 dup3
0.00 0.000000 0 2 pipe2
------ ----------- ----------- --------- --------- ----------------
100.00 9.333037 40661 3141 total
我没有在我的代码的任何setTimeout
电话,但我想我有一个不依赖。我查阅了最近的更改,它似乎不涉及递归调用或永不结束的循环。
我观察到零内存泄漏,即pm2内存大小不会增加超时。以前相同的程序运行了2个月而没有重新启动,负载类似。服务器具有更多的CPU,RAM,交换资源。
这个问题在Ubuntu的一些常规维护(升级nodejs的apt-get升级,以及mongodb和npm依赖性升级)后开始出现。 nodejs升级从4.6.1升级到4.6.2。但是当我降级到4.6.1时,问题仍然存在。我尝试了4.4.7和6.9.1,没有任何版本看起来没有问题。
如何调试此问题?我从哪说起呢?
是否可以记录传入呼叫以构建模拟以在调试环境中进入该状况? – Bernhard
你是否正在运行pm2观看文件更改?我会禁用,看看是否导致高CPU使用率。 –
你需要更多地缩小这个范围。这是否仅在特定位代码被执行时才会发生?你可以在不同的物理机器上重现它吗? – tadman