Have stalled snmpd in recvfrom()? Check Recv-Q
Not so while ago I had an issue with a monitoring system that paged about SNMP checks failing on a number of servers. Quick checking here and there (logs, strace, tcpdump, etc.) revealed that snmpd had stalled in recvfrom() without sending a single packet out in response to the constant queries from our monitoring system. Everything seemed to be ok except “netstat -s” that showed a steady increase in “Udp: packet receive errors” counter. Summon ss to the rescue:
# ss -ianump \( sport = *:161 \) State Recv-Q Send-Q Local Address:Port Peer Address:Port UNCONN 262680 0 *:161 *:* users:(("snmpd",52984,7))
Matching 262680 with “sysctl net.core.rmem_default” suggested that the receiving buffers (Recv-Q) were filling up but why Taking a close look at the logs returned the following segfault:
cmanicd[55673]: segfault at 0 ip 00007f041e721081 sp 00007f040e16c700 error 4 in libnetsnmp.so.20.0.0[7f041e6a1000+a0000]
It turned out to be a well known issue with NIC Agent (CMANICD):
http://h20564.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-c04912220&sp4ts.oid=316583
So it looked to be our guy. Starting cmanicd back immediately solved the problem:
[root@slon02db12 ~]# ss -ianump \( sport = *:161 \) State Recv-Q Send-Q Local Address:Port Peer Address:Port UNCONN 0 0 *:161 *:* users:(("snmpd",52984,7))
Recv-Q was dropped to zero and a server became green in the monitoring dashboard. Bingo. Problem solved so now it’s time for the upgrade.
Btw, If you don’t know how to read Linux segfault message (I didn’t know that myself before this issue) then the following note could fix that:
Nov 27 15:26:19 machine kernel: fmg[6335]: segfault at 00000000ffffd2dc rip 00000000ffffd2dc rsp 00000000ffffd1bc error 15
What does the kernel message mean, in detail?
- The rip value is the instruction pointer register value, the rsp is the stack pointer register value.
- The error value is a bit mask of page fault error code bits (from arch/x86/mm/fault.c):
Raw * bit 0 == 0: no page found 1: protection fault * bit 1 == 0: read access 1: write access * bit 2 == 0: kernel-mode access 1: user-mode access * bit 3 == 1: use of reserved bit detected * bit 4 == 1: fault was an instruction fetchHere’s error bit definition: Raw enum x86_pf_error_code { PF_PROT = 1 << 0, PF_WRITE = 1 << 1, PF_USER = 1 << 2, PF_RSVD = 1 << 3, PF_INSTR = 1 << 4, };
In my case error code was 4 which means cmanicd tried to access address zero from the user space which reeks a NULL pointer dereference.