Ticket 1534 - Memory Leak in slurmdbd
Summary: Memory Leak in slurmdbd
Status: RESOLVED CANNOTREPRODUCE
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmdbd (show other tickets)
Version: 14.11.4
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-03-13 07:44 MDT by Paul Edmon
Modified: 2015-03-27 06:55 MDT (History)
2 users (show)

See Also:
Site: Harvard University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmdbd.conf (695 bytes, text/plain)
2015-03-13 15:44 MDT, Paul Edmon
Details
valgrind Log (188.44 KB, text/plain)
2015-03-18 08:08 MDT, Paul Edmon
Details
valgrind Log 2 (341.14 KB, text/plain)
2015-03-20 01:28 MDT, Paul Edmon
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Paul Edmon 2015-03-13 07:44:06 MDT
Since we've upgraded to 14.11.4 I've noticed that slurmdbd seems to have a memory leak.  I don't know where, but several times over the past 2 weeks our master which hosts the DB runs out of memory due to the slurmdbd consuming it all, this then causes all sorts of havoc.

Here is a shot of current top:

[root@holy-slurm01 mysql]# top

top - 15:41:55 up  5:56,  6 users,  load average: 3.47, 3.44, 3.45
Tasks: 1562 total,   1 running, 1561 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.7%us,  0.6%sy,  0.0%ni, 96.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  264498708k total, 102553984k used, 161944724k free,   471728k buffers
Swap:  4194296k total,        0k used,  4194296k free, 27860636k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                      
43528 slurm     20   0 19.7g 2.9g 3132 S 100.9  1.2 373:36.82 slurmctld                                                                                                                                   
 4821 mysql     20   0 11.2g 347m 5160 S 78.2  0.1 402:31.48 mysqld                                                                                                                                       
 4989 slurm     20   0 64.4g  63g 3028 S 21.0 25.4  68:42.30 slurmdbd                                                                                                                                     
 5269 root      20   0 14216 2368  912 R  1.3  0.0   0:03.93 top                                                                                                                                          
 3936 root      20   0 62472  11m 1800 S  0.7  0.0   1:31.93 syslog-ng                                                                                                                                    
44421 root      20   0 17068 1188  692 D  0.7  0.0   0:12.74 cp                                                                                                                                           
  302 root      20   0     0    0    0 S  0.3  0.0   0:01.76 events/43                                                                                                                                    
  394 root      20   0     0    0    0 S  0.3  0.0   0:04.85 kblockd/0                                                                                                                                    
  864 root      39  19     0    0    0 S  0.3  0.0   5:05.03 kipmi0                                                                                                                                       
 3518 root      20   0     0    0    0 S  0.3  0.0   0:14.70 flush-253:0                                                                                                                                  
 3985 root      20   0     0    0    0 S  0.3  0.0   0:04.64 kondemand/23                                                                                                                                 
 3998 root      20   0     0    0    0 S  0.3  0.0   0:04.89 kondemand/36                                                                                                                                 
 4013 root      20   0     0    0    0 S  0.3  0.0   0:22.58 kondemand/51                                                                                                                                 
 4332 root      20   0  134m  37m 2392 S  0.3  0.0   1:06.91 mcollectived                                                                                                                                 
 4541 root      20   0     0    0    0 S  0.3  0.0   0:00.57 nfsd                                                                                                                                         
 4612 munge     20   0  218m 3096 1148 S  0.3  0.0   1:55.46 munged   

You can see that slurmdbd is using up 64 GB and seems to be gaining a GB every hour or so.  I'm going to restart slurmdbd to keep this down, but we should isolate the leak.  I'm not sure where it would be though.  Let me know if there are any diagnostics you want me to run to track this back.  Thanks.
Comment 1 Paul Edmon 2015-03-13 07:44:54 MDT
I will note that the actual slurm db in the mysql directory is 34 GB in size.
Comment 2 David Bigagli 2015-03-13 08:33:50 MDT
Hi Paul, did you try to run slurmdbd under valgrind for a while? 
I will try to do it in house to see if I catch anything using my database.

David
Comment 3 Moe Jette 2015-03-13 08:40:09 MDT
It may also help if you send your config files (including slurmdbd.conf), but remove the mysql password ;)
Comment 4 Paul Edmon 2015-03-13 15:44:01 MDT
Created attachment 1716 [details]
slurmdbd.conf

Sure.  Here is our conf.  I won't really have a chance to run it with 
valgrind until Monday.  slurmdbd seems stable currently.  So I suspect 
it may be that a specific query sets off the run away.

-Paul Edmon-

On 3/13/2015 4:40 PM, bugs@schedmd.com wrote:
>
> *Comment # 3 <http://bugs.schedmd.com/show_bug.cgi?id=1534#c3> on bug 
> 1534 <http://bugs.schedmd.com/show_bug.cgi?id=1534> from Moe Jette 
> <mailto:jette@schedmd.com> *
> It may also help if you send your config files (including slurmdbd.conf), but
> remove the mysql password ;)
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 5 David Bigagli 2015-03-14 05:30:22 MDT
That seems to be the case. I ran overnight with different load under valgrind and so far no leaks. EAGAIN!

David
Comment 6 Paul Edmon 2015-03-14 06:43:48 MDT
Next time it happens I will see if I can strace it and figure out what's up.

-Paul Edmon-

On 3/14/2015 1:30 PM, bugs@schedmd.com wrote:
>
> *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1534#c5> on bug 
> 1534 <http://bugs.schedmd.com/show_bug.cgi?id=1534> from David Bigagli 
> <mailto:david@schedmd.com> *
> That seems to be the case. I ran overnight with different load under valgrind
> and so far no leaks. EAGAIN!
>
> David
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 7 Paul Edmon 2015-03-17 01:35:49 MDT
So it happened again.  This is the snapshot.  I also managed to get gdb to get a core dump which is 175GB in size.  Let me know what you want me to do with it.

[root@holy-slurm01 ~]# top

top - 09:10:29 up 3 days, 23:24,  3 users,  load average: 3.72, 3.52, 3.66
Tasks: 1549 total,   2 running, 1547 sleeping,   0 stopped,   0 zombie
Cpu(s):  4.2%us,  2.6%sy,  0.0%ni, 93.0%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  264498708k total, 207899796k used, 56598912k free,   159460k buffers
Swap:  4194296k total,        0k used,  4194296k free, 16401816k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                      
 4821 mysql     20   0 11.2g 389m 5172 S 127.4  0.2   2065:55 mysqld                                                                                                                                      
27127 slurm     20   0  178g 174g 1940 S 117.3 69.1   1010:20 slurmdbd                                                                                                                                    
 1926 slurm     20   0 23.1g 3.6g 3452 S 100.2  1.4   1469:46 slurmctld                                                                                                                                   
 3936 root      20   0 62472  11m 1804 R 99.2  0.0 720:11.08 syslog-ng                                                                                                                                    
52959 root      20   0 98.6m  592  512 S  2.0  0.0  11:26.12 tail          


tail -f /var/log/messsages
Mar 16 22:53:28 holy-slurm01 slurmdbd[27127]: error: pack32: Buffer size limit exceeded (-55586 > -65536)
Mar 16 22:53:28 holy-slurm01 slurmdbd[27127]: error: pack32: Buffer size limit exceeded (-55586 > -65536)
Mar 16 22:53:28 holy-slurm01 slurmdbd[27127]: error: pack32: Buffer size limit exceeded (-55586 > -65536)
Mar 16 22:53:28 holy-slurm01 slurmdbd[27127]: error: pack_time: Buffer size limit exceeded (-55586 > -65536)
Mar 16 22:53:28 holy-slurm01 slurmdbd[27127]: error: pack32: Buffer size limit exceeded (-55586 > -65536)
Mar 16 22:53:28 holy-slurm01 slurmdbd[27127]: error: pack32: Buffer size limit exceeded (-55586 > -65536)
Mar 16 22:53:28 holy-slurm01 slurmdbd[27127]: error: pack32: Buffer size limit exceeded (-55586 > -65536)
Mar 16 22:53:28 holy-slurm01 slurmdbd[27127]: error: packmem: Buffer size limit exceeded (-55574 > -65536)
Mar 16 22:53:28 holy-slurm01 slurmdbd[27127]: error: pack32: Buffer size limit exceeded (-55586 > -65536)
Mar 16 22:53:28 holy-slurm01 slurmdbd[27127]: error: pack32: Buffer size limit exceeded (-55586 > -65536)
Mar 16 22:53:28 holy-slurm01 slurmdbd[27127]: error: pack32: Buffer size limit exceeded (-55586 > -65536)
Mar 16 22:53:28 holy-slurm01 slurmdbd[27127]: error: pack64: Buffer size limit exceeded (-55586 > -65536)
Mar 16 22:53:28 holy-slurm01 slurmdbd[27127]: error: pack64: Buffer size limit exceeded (-55586 > -65536)
Mar 16 22:53:28 holy-slurm01 slurmdbd[27127]: error: pack64: Buffer size limit exceeded (-55586 > -65536)
Mar 16 22:53:28 holy-slurm01 slurmdbd[27127]: error: pack32: Buffer size limit exceeded (-55586 > -65536)
Mar 16 22:53:28 holy-slurm01 slurmdbd[27127]: error: packdouble: Buffer size limit exceeded (-55586 > -65536)^C
Mar 16 22:53:28 holy-slurm01 slurmdbd[27127]: error: packdouble: Buffer size limit exceeded (-55586 > -65536)
Mar 16 22:53:28 holy-slurm01 slurmdbd[27127]: error: packdouble: Buffer size limit exceeded (-55586 > -65536)

strace -fp 27127
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0
[pid 34726] sendto(11, "<27>Mar 17 09:12:30 slurmdbd[271"..., 101, MSG_NOSIGNAL, NULL, 0) = 101
[pid 34726] close(11)                   = 0
[pid 34726] futex(0x74da00, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 28265] <... futex resumed> )       = 0
[pid 34726] socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0 <unfinished ...>
[pid 28265] futex(0x74da00, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 34726] <... socket resumed> )      = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 EPROTOTYPE (Protocol wrong type for socket)
[pid 34726] close(11)                   = 0
[pid 34726] socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC, 0) = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0
[pid 34726] sendto(11, "<27>Mar 17 09:12:30 slurmdbd[271"..., 101, MSG_NOSIGNAL, NULL, 0) = 101
[pid 34726] close(11)                   = 0
[pid 34726] futex(0x74da00, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 59996] <... futex resumed> )       = 0
[pid 34726] socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0 <unfinished ...>
[pid 59996] futex(0x74da00, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 34726] <... socket resumed> )      = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 EPROTOTYPE (Protocol wrong type for socket)
[pid 34726] close(11)                   = 0
[pid 34726] socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC, 0) = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0
[pid 34726] sendto(11, "<27>Mar 17 09:12:30 slurmdbd[271"..., 97, MSG_NOSIGNAL, NULL, 0) = 97
[pid 34726] close(11)                   = 0
[pid 34726] futex(0x74da00, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 28265] <... futex resumed> )       = 0
[pid 34726] socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0 <unfinished ...>
[pid 28265] futex(0x74da00, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 34726] <... socket resumed> )      = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 EPROTOTYPE (Protocol wrong type for socket)
[pid 34726] close(11)                   = 0
[pid 34726] socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC, 0) = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0
[pid 34726] sendto(11, "<27>Mar 17 09:12:30 slurmdbd[271"..., 97, MSG_NOSIGNAL, NULL, 0) = 97
[pid 34726] close(11)                   = 0
[pid 34726] futex(0x74da00, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 59996] <... futex resumed> )       = 0
[pid 34726] socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0 <unfinished ...>
[pid 59996] futex(0x74da00, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 34726] <... socket resumed> )      = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 EPROTOTYPE (Protocol wrong type for socket)
[pid 34726] close(11)                   = 0
[pid 34726] socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC, 0) = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0
[pid 34726] sendto(11, "<27>Mar 17 09:12:30 slurmdbd[271"..., 97, MSG_NOSIGNAL, NULL, 0) = 97
[pid 34726] close(11)                   = 0
[pid 34726] futex(0x74da00, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 28265] <... futex resumed> )       = 0
[pid 34726] socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0 <unfinished ...>
[pid 28265] futex(0x74da00, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 34726] <... socket resumed> )      = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 EPROTOTYPE (Protocol wrong type for socket)
[pid 34726] close(11)                   = 0
[pid 34726] socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC, 0) = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0
[pid 34726] sendto(11, "<27>Mar 17 09:12:30 slurmdbd[271"..., 97, MSG_NOSIGNAL, NULL, 0) = 97
[pid 34726] close(11)                   = 0
[pid 34726] futex(0x74da00, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 34726] socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0 <unfinished ...>
[pid 59996] <... futex resumed> )       = 0
[pid 34726] <... socket resumed> )      = 11
[pid 59996] futex(0x74da00, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 EPROTOTYPE (Protocol wrong type for socket)
[pid 34726] close(11)                   = 0
[pid 34726] socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC, 0) = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0
[pid 34726] sendto(11, "<27>Mar 17 09:12:30 slurmdbd[271"..., 97, MSG_NOSIGNAL, NULL, 0) = 97
[pid 34726] close(11)                   = 0
[pid 34726] futex(0x74da00, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 34726] socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0 <unfinished ...>
[pid 28265] <... futex resumed> )       = 0
[pid 34726] <... socket resumed> )      = 11
[pid 28265] futex(0x74da00, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 EPROTOTYPE (Protocol wrong type for socket)
[pid 34726] close(11)                   = 0
[pid 34726] socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC, 0) = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0
[pid 34726] sendto(11, "<27>Mar 17 09:12:30 slurmdbd[271"..., 97, MSG_NOSIGNAL, NULL, 0) = 97
[pid 34726] close(11)                   = 0
[pid 34726] futex(0x74da00, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 34726] socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0 <unfinished ...>
[pid 59996] <... futex resumed> )       = 0
[pid 34726] <... socket resumed> )      = 11
[pid 59996] futex(0x74da00, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 EPROTOTYPE (Protocol wrong type for socket)
[pid 34726] close(11)                   = 0
[pid 34726] socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC, 0) = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0
[pid 34726] sendto(11, "<27>Mar 17 09:12:30 slurmdbd[271"..., 97, MSG_NOSIGNAL, NULL, 0) = 97
[pid 34726] close(11)                   = 0
[pid 34726] futex(0x74da00, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 34726] socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0 <unfinished ...>
[pid 28265] <... futex resumed> )       = 0
[pid 34726] <... socket resumed> )      = 11
[pid 28265] futex(0x74da00, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 EPROTOTYPE (Protocol wrong type for socket)
[pid 34726] close(11)                   = 0
[pid 34726] socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC, 0) = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0
[pid 34726] sendto(11, "<27>Mar 17 09:12:30 slurmdbd[271"..., 97, MSG_NOSIGNAL, NULL, 0) = 97
[pid 34726] close(11)                   = 0
[pid 34726] futex(0x74da00, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 34726] socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0 <unfinished ...>
[pid 59996] <... futex resumed> )       = 0
[pid 34726] <... socket resumed> )      = 11
[pid 59996] futex(0x74da00, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 EPROTOTYPE (Protocol wrong type for socket)
[pid 34726] close(11)                   = 0
[pid 34726] socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC, 0) = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0
[pid 34726] sendto(11, "<27>Mar 17 09:12:30 slurmdbd[271"..., 97, MSG_NOSIGNAL, NULL, 0) = 97
[pid 34726] close(11)                   = 0
[pid 34726] futex(0x74da00, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 28265] <... futex resumed> )       = 0
[pid 34726] socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0 <unfinished ...>
[pid 28265] futex(0x74da00, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 34726] <... socket resumed> )      = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 EPROTOTYPE (Protocol wrong type for socket)
[pid 34726] close(11)                   = 0
[pid 34726] socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC, 0) = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0
[pid 34726] sendto(11, "<27>Mar 17 09:12:30 slurmdbd[271"..., 97, MSG_NOSIGNAL, NULL, 0) = 97
[pid 34726] close(11)                   = 0
[pid 34726] futex(0x74da00, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 34726] socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0 <unfinished ...>
[pid 59996] <... futex resumed> )       = 0
[pid 34726] <... socket resumed> )      = 11
[pid 59996] futex(0x74da00, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 EPROTOTYPE (Protocol wrong type for socket)
[pid 34726] close(11)                   = 0
[pid 34726] socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC, 0) = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0
[pid 34726] sendto(11, "<27>Mar 17 09:12:30 slurmdbd[271"..., 97, MSG_NOSIGNAL, NULL, 0) = 97
[pid 34726] close(11)                   = 0
[pid 34726] futex(0x74da00, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 34726] socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0 <unfinished ...>
[pid 28265] <... futex resumed> )       = 0
[pid 34726] <... socket resumed> )      = 11
[pid 28265] futex(0x74da00, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 EPROTOTYPE (Protocol wrong type for socket)
[pid 34726] close(11)                   = 0
[pid 34726] socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC, 0) = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0
[pid 34726] sendto(11, "<27>Mar 17 09:12:30 slurmdbd[271"..., 97, MSG_NOSIGNAL, NULL, 0) = 97
[pid 34726] close(11)                   = 0
[pid 34726] futex(0x74da00, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 59996] <... futex resumed> )       = 0
[pid 34726] socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0 <unfinished ...>
[pid 59996] futex(0x74da00, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 34726] <... socket resumed> )      = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 EPROTOTYPE (Protocol wrong type for socket)
[pid 34726] close(11)                   = 0
[pid 34726] socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC, 0) = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0
[pid 34726] sendto(11, "<27>Mar 17 09:12:30 slurmdbd[271"..., 100, MSG_NOSIGNAL, NULL, 0) = 100
[pid 34726] close(11)                   = 0
[pid 34726] futex(0x74da00, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 34726] socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0 <unfinished ...>
[pid 28265] <... futex resumed> )       = 0
[pid 34726] <... socket resumed> )      = 11
[pid 28265] futex(0x74da00, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 EPROTOTYPE (Protocol wrong type for socket)
[pid 34726] close(11)                   = 0
[pid 34726] socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC, 0) = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0
[pid 34726] sendto(11, "<27>Mar 17 09:12:30 slurmdbd[271"..., 97, MSG_NOSIGNAL, NULL, 0) = 97
[pid 34726] close(11)                   = 0
[pid 34726] futex(0x74da00, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 59996] <... futex resumed> )       = 0
[pid 34726] socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0 <unfinished ...>
[pid 59996] futex(0x74da00, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 34726] <... socket resumed> )      = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 EPROTOTYPE (Protocol wrong type for socket)
[pid 34726] close(11)                   = 0
[pid 34726] socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC, 0) = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0
[pid 34726] sendto(11, "<27>Mar 17 09:12:30 slurmdbd[271"..., 97, MSG_NOSIGNAL, NULL, 0) = 97
[pid 34726] close(11)                   = 0
[pid 34726] futex(0x74da00, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 28265] <... futex resumed> )       = 0
[pid 28265] futex(0x74da00, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 34726] socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 11
[pid 34726] connect(11, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 EPROTOTYPE (Protocol wrong type for socket)
[pid 34726] close(11)                   = 0
[pid 34726] socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC, 0) = 11

[root@holy-slurm01 ~]# strace -fcp 27127
Process 27127 attached with 10 threads - interrupt to quit
^CProcess 27127 detached
Process 27129 detached
Process 27130 detached
Process 27978 detached
Process 27979 detached
Process 1932 detached
Process 34726 detached
Process 28265 detached
Process 59996 detached
Process 25005 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 71.29    8.234048         363     22667        21 futex
 17.16    1.981699     1981699         1           accept
  9.84    1.136828     1136828         1           restart_syscall
  0.91    0.105260         196       536           read
  0.23    0.026794           1     22496           socket
  0.22    0.025100          47       536           write
  0.15    0.017585           1     22496     11248 connect
  0.12    0.013963           1     22497           close
  0.06    0.006398           1     11249           sendto
  0.01    0.001176           2       536           poll
  0.01    0.001006           7       147           mprotect
  0.00    0.000053          27         2           fcntl
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         1           clone
------ ----------- ----------- --------- --------- ----------------
100.00   11.549910                103166     11269 total
[root@holy-slurm01 ~]#
Comment 8 David Bigagli 2015-03-17 04:31:58 MDT
We would like to have a look at the stack. If you can do:

gdb slurmdbd core

then at the gdb prompt type:

thread apply all where

David
Comment 9 Paul Edmon 2015-03-17 04:37:23 MDT
[root@holy-slurm01 log]# gdb slurmdbd core.27127
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6_4.1)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmdbd...done.
[New Thread 27129]
[New Thread 27130]
[New Thread 27978]
[New Thread 27979]
[New Thread 1932]
[New Thread 34726]
[New Thread 28265]
[New Thread 59996]
[New Thread 40251]
[New Thread 27127]
Reading symbols from /lib64/libdl.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols 
found)...done.
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib64/slurm/auth_munge.so...done.
Loaded symbols for /usr/lib64/slurm/auth_munge.so
Reading symbols from /usr/lib64/libmunge.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/libmunge.so.2
Reading symbols from /usr/lib64/slurm/accounting_storage_mysql.so...done.
Loaded symbols for /usr/lib64/slurm/accounting_storage_mysql.so
Reading symbols from /usr/lib64/mysql/libmysqlclient_r.so.16...(no 
debugging symbols found)...done.
Loaded symbols for /usr/lib64/mysql/libmysqlclient_r.so.16
Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libnsl.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libnsl.so.1
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /usr/lib64/libssl.so.10...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/libssl.so.10
Reading symbols from /usr/lib64/libcrypto.so.10...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/libcrypto.so.10
Reading symbols from /usr/lib64/libfreebl3.so...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/libfreebl3.so
Reading symbols from /lib64/libgssapi_krb5.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libgssapi_krb5.so.2
Reading symbols from /lib64/libkrb5.so.3...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libkrb5.so.3
Reading symbols from /lib64/libcom_err.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libcom_err.so.2
Reading symbols from /lib64/libk5crypto.so.3...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libk5crypto.so.3
Reading symbols from /lib64/libkrb5support.so.0...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libkrb5support.so.0
Reading symbols from /lib64/libkeyutils.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libkeyutils.so.1
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /lib64/libselinux.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libselinux.so.1
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Core was generated by `/usr/sbin/slurmdbd'.
#0  0x000000325520f4b5 in sigwait () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install 
slurm-slurmdbd-14.11.4-1fasrc01.el6.x86_64
(gdb) thread apply all where

Thread 10 (Thread 0x7fcc6a02e700 (LWP 27127)):
#0  0x000000325520822d in pthread_join () from /lib64/libpthread.so.0
#1  0x000000000042ffe7 in main (argc=<value optimized out>, argv=<value 
optimized out>) at slurmdbd.c:256

Thread 9 (Thread 0x7fc4a3fff700 (LWP 40251)):
#0  0x000000325520e75d in read () from /lib64/libpthread.so.0
#1  0x0000003255e574b9 in vio_read_buff () from 
/usr/lib64/mysql/libmysqlclient_r.so.16
#2  0x0000003255e5861f in ?? () from /usr/lib64/mysql/libmysqlclient_r.so.16
#3  0x0000003255e58a25 in my_net_read () from 
/usr/lib64/mysql/libmysqlclient_r.so.16
#4  0x0000003255e5304a in cli_safe_read () from 
/usr/lib64/mysql/libmysqlclient_r.so.16
#5  0x0000003255e537c9 in ?? () from /usr/lib64/mysql/libmysqlclient_r.so.16
#6  0x0000003255e51b6c in mysql_real_query () from 
/usr/lib64/mysql/libmysqlclient_r.so.16
#7  0x00007fcc69bd7cea in _mysql_query_internal (db_conn=0x7fc49d5c0c80,
     query=0x7fc49d79bfc0 "select t1.account, t1.array_max_tasks, 
t1.array_task_str, t1.cpus_alloc, t1.cpus_req, t1.derived_ec, 
t1.derived_es, t1.exit_code, t1.id_array_job, t1.id_array_task, 
t1.id_assoc, t1.id_block, t1.id_gro"...) at mysql_common.c:142
#8  0x00007fcc69bd80c5 in mysql_db_query_ret (mysql_conn=0x7fc49e94d3f0,
     query=0x7fc49d79bfc0 "select t1.account, t1.array_max_tasks, 
t1.array_task_str, t1.cpus_alloc, t1.cpus_req, t1.derived_ec, 
t1.derived_es, t1.exit_code, t1.id_array_job, t1.id_array_task, 
t1.id_assoc, t1.id_block, t1.id_gro"..., last=false) at mysql_common.c:829
#9  0x00007fcc69bc1e61 in _cluster_get_jobs (mysql_conn=0x7fc49e94d3f0, 
uid=<value optimized out>, job_cond=<value optimized out>) at 
as_mysql_jobacct_process.c:464
#10 as_mysql_jobacct_process_get_jobs (mysql_conn=0x7fc49e94d3f0, 
uid=<value optimized out>, job_cond=<value optimized out>) at 
as_mysql_jobacct_process.c:1592
#11 0x000000000042521c in _get_jobs_cond (slurmdbd_conn=0x7fcc60000fb0, 
in_buffer=0x7fc49e94d3c0, out_buffer=0x7fc4a3ffee58, uid=0x7fc4a3ffee78) 
at proc_req.c:1376
#12 0x000000000042a701 in proc_req (slurmdbd_conn=0x7fcc60000fb0, 
msg=<value optimized out>, msg_size=<value optimized out>, first=false, 
out_buffer=0x7fc4a3ffee58, uid=0x7fc4a3ffee78)
     at proc_req.c:286
#13 0x000000000042f09c in _service_connection (arg=<value optimized 
out>) at rpc_mgr.c:232
#14 0x00000032552079d1 in start_thread () from /lib64/libpthread.so.0
#15 0x0000003254ae88fd in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7fcc68f3a700 (LWP 59996)):
#0  0x000000325520e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003255209508 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00000032552093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000448772 in log_msg (level=LOG_LEVEL_DEBUG4, 
fmt=0x7fcc69bdbc7f "%d(%s:%d) query\n%s", args=0x7fcc68f39a70) at log.c:964
#4  0x0000000000448ea5 in debug4 (fmt=<value optimized out>) at log.c:1218
#5  0x00007fcc69bc2634 in _cluster_get_jobs (mysql_conn=0x7fcc5345fa70, 
uid=<value optimized out>, job_cond=<value optimized out>) at 
as_mysql_jobacct_process.c:742
#6  as_mysql_jobacct_process_get_jobs (mysql_conn=0x7fcc5345fa70, 
uid=<value optimized out>, job_cond=<value optimized out>) at 
as_mysql_jobacct_process.c:1592
#7  0x000000000042521c in _get_jobs_cond (slurmdbd_conn=0x7fcc60000a70, 
in_buffer=0x7fcc5345fa40, out_buffer=0x7fcc68f39e58, uid=0x7fcc68f39e78) 
at proc_req.c:1376
#8  0x000000000042a701 in proc_req (slurmdbd_conn=0x7fcc60000a70, 
msg=<value optimized out>, msg_size=<value optimized out>, first=false, 
out_buffer=0x7fcc68f39e58, uid=0x7fcc68f39e78)
     at proc_req.c:286
#9  0x000000000042f09c in _service_connection (arg=<value optimized 
out>) at rpc_mgr.c:232
#10 0x00000032552079d1 in start_thread () from /lib64/libpthread.so.0
#11 0x0000003254ae88fd in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7fcc6903b700 (LWP 28265)):
#0  0x000000325520e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003255209508 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00000032552093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000448772 in log_msg (level=LOG_LEVEL_DEBUG4, 
fmt=0x7fcc69bdbc7f "%d(%s:%d) query\n%s", args=0x7fcc6903aa70) at log.c:964
---Type <return> to continue, or q <return> to quit---
#4  0x0000000000448ea5 in debug4 (fmt=<value optimized out>) at log.c:1218
#5  0x00007fcc69bc2634 in _cluster_get_jobs (mysql_conn=0x7fcc465fc880, 
uid=<value optimized out>, job_cond=<value optimized out>) at 
as_mysql_jobacct_process.c:742
#6  as_mysql_jobacct_process_get_jobs (mysql_conn=0x7fcc465fc880, 
uid=<value optimized out>, job_cond=<value optimized out>) at 
as_mysql_jobacct_process.c:1592
#7  0x000000000042521c in _get_jobs_cond (slurmdbd_conn=0x7fcc60000db0, 
in_buffer=0x7fcc58817070, out_buffer=0x7fcc6903ae58, uid=0x7fcc6903ae78) 
at proc_req.c:1376
#8  0x000000000042a701 in proc_req (slurmdbd_conn=0x7fcc60000db0, 
msg=<value optimized out>, msg_size=<value optimized out>, first=false, 
out_buffer=0x7fcc6903ae58, uid=0x7fcc6903ae78)
     at proc_req.c:286
#9  0x000000000042f09c in _service_connection (arg=<value optimized 
out>) at rpc_mgr.c:232
#10 0x00000032552079d1 in start_thread () from /lib64/libpthread.so.0
#11 0x0000003254ae88fd in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7fcc6923d700 (LWP 34726)):
#0  0x0000003254adb47d in close () from /lib64/libc.so.6
#1  0x0000003254ae4544 in openlog_internal () from /lib64/libc.so.6
#2  0x0000003254ae4e07 in __vsyslog_chk () from /lib64/libc.so.6
#3  0x0000003254ae5090 in syslog () from /lib64/libc.so.6
#4  0x0000000000448a7b in log_msg (level=<value optimized out>, 
fmt=<value optimized out>, args=0x7fcc6923cb40) at log.c:1074
#5  0x0000000000449265 in error (fmt=<value optimized out>) at log.c:1154
#6  0x0000000000450e5a in packmem (valp=0x7fb702d7b0b0 "jenny01", 
size_val=8, buffer=0x7fa28a579fb0) at pack.c:520
#7  0x00000000004a5d30 in slurmdb_pack_job_rec (object=0x7fb702d7ad90, 
rpc_version=7168, buffer=0x7fa28a579fb0) at slurmdb_pack.c:3572
#8  0x00000000004aebc4 in slurmdbd_pack_list_msg (msg=0x7fcc6923ccd0, 
rpc_version=7168, type=<value optimized out>, buffer=0x7fa28a579fb0) at 
slurmdbd_defs.c:3535
#9  0x000000000042526c in _get_jobs_cond (slurmdbd_conn=<value optimized 
out>, in_buffer=<value optimized out>, out_buffer=0x7fcc6923ce58, 
uid=<value optimized out>) at proc_req.c:1384
#10 0x000000000042a701 in proc_req (slurmdbd_conn=0x7fcc60000c10, 
msg=<value optimized out>, msg_size=<value optimized out>, first=false, 
out_buffer=0x7fcc6923ce58, uid=0x7fcc6923ce78)
     at proc_req.c:286
#11 0x000000000042f09c in _service_connection (arg=<value optimized 
out>) at rpc_mgr.c:232
#12 0x00000032552079d1 in start_thread () from /lib64/libpthread.so.0
#13 0x0000003254ae88fd in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fcc6913c700 (LWP 1932)):
#0  0x000000325520e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003255209508 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00000032552093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000448772 in log_msg (level=LOG_LEVEL_DEBUG4, 
fmt=0x7fcc69bdbfc6 "got %d commits", args=0x7fcc6913bc00) at log.c:964
#4  0x0000000000448ea5 in debug4 (fmt=<value optimized out>) at log.c:1218
#5  0x00007fcc69ba9441 in acct_storage_p_commit 
(mysql_conn=0x7fcc54000ce0, commit=true) at accounting_storage_mysql.c:2229
#6  0x000000000042a080 in proc_req (slurmdbd_conn=0x7fcc600008d0, 
msg=<value optimized out>, msg_size=<value optimized out>, first=<value 
optimized out>, out_buffer=0x7fcc6913be58,
     uid=<value optimized out>) at proc_req.c:476
#7  0x000000000042f09c in _service_connection (arg=<value optimized 
out>) at rpc_mgr.c:232
#8  0x00000032552079d1 in start_thread () from /lib64/libpthread.so.0
#9  0x0000003254ae88fd in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fcc6933e700 (LWP 27979)):
#0  0x0000003254aaca3d in nanosleep () from /lib64/libc.so.6
#1  0x0000003254aac8b0 in sleep () from /lib64/libc.so.6
#2  0x00000000004305ba in _rollup_handler (db_conn=0x26adb10) at 
slurmdbd.c:605
#3  0x00000032552079d1 in start_thread () from /lib64/libpthread.so.0
#4  0x0000003254ae88fd in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fcc6943f700 (LWP 27978)):
---Type <return> to continue, or q <return> to quit---
#0  0x000000325520ea5d in accept () from /lib64/libpthread.so.0
#1  0x0000000000490456 in slurm_accept_msg_conn (fd=<value optimized 
out>, addr=<value optimized out>) at 
slurm_protocol_socket_implementation.c:480
#2  0x000000000042e90b in rpc_mgr (no_data=<value optimized out>) at 
rpc_mgr.c:132
#3  0x00000032552079d1 in start_thread () from /lib64/libpthread.so.0
#4  0x0000003254ae88fd in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fcc695c6700 (LWP 27130)):
#0  0x0000003254aaca3d in nanosleep () from /lib64/libc.so.6
#1  0x0000003254aac8b0 in sleep () from /lib64/libc.so.6
#2  0x00000000004306d5 in _commit_handler (db_conn=<value optimized 
out>) at slurmdbd.c:662
#3  0x00000032552079d1 in start_thread () from /lib64/libpthread.so.0
#4  0x0000003254ae88fd in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fcc696c7700 (LWP 27129)):
#0  0x000000325520f4b5 in sigwait () from /lib64/libpthread.so.0
#1  0x000000000043090a in _signal_handler (no_data=<value optimized 
out>) at slurmdbd.c:722
#2  0x00000032552079d1 in start_thread () from /lib64/libpthread.so.0
#3  0x0000003254ae88fd in clone () from /lib64/libc.so.6
(gdb)


On 03/17/2015 12:31 PM, bugs@schedmd.com wrote:
>
> *Comment # 8 <http://bugs.schedmd.com/show_bug.cgi?id=1534#c8> on bug 
> 1534 <http://bugs.schedmd.com/show_bug.cgi?id=1534> from David Bigagli 
> <mailto:david@schedmd.com> *
> We would like to have a look at the stack. If you can do:
>
> gdb slurmdbd core
>
> then at the gdb prompt type:
>
> thread apply all where
>
> David
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 10 David Bigagli 2015-03-17 05:09:37 MDT
Thanks. The thread 6 is the suspicious one. What it seems happening is a wrong
malloc somehow that allocate huge amount of memory. Do you remember if you ran 
sacct at that time? How many jobs you have in the database?

David
Comment 11 Paul Edmon 2015-03-17 05:13:40 MDT
We have about 35 million at this point (about 40 GB in size).  We have 
it set up to purge, but we haven't had a chance to run on yet as we 
haven't hit the 1st of the month.

We have various people querying the database, so I don't really know 
what sacct command caused this.  I anticipate it was caused by one 
though, but what specifically was queried I can't say as its not just us 
but users querying the database.

-Paul Edmon-

On 03/17/2015 01:09 PM, bugs@schedmd.com wrote:
>
> *Comment # 10 <http://bugs.schedmd.com/show_bug.cgi?id=1534#c10> on 
> bug 1534 <http://bugs.schedmd.com/show_bug.cgi?id=1534> from David 
> Bigagli <mailto:david@schedmd.com> *
> Thanks. The thread 6 is the suspicious one. What it seems happening is a wrong
> malloc somehow that allocate huge amount of memory. Do you remember if you ran
> sacct at that time? How many jobs you have in the database?
>
> David
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 12 David Bigagli 2015-03-17 06:27:25 MDT
We are debugging on our side. Would it be possible to run slurmdbd under valgrind?

>valgrind --leak-check=full --show-reachable=yes --log-file=Log ./slurmdbd 

Under what memory limit you run slurmdbd if any? ulimit -a.

David
Comment 13 Paul Edmon 2015-03-17 07:14:13 MDT
Sure, I can do that when I have a moment.  I till take me a bit to set 
that up.

We run it with ulimit -a

-Paul Edmon-

On 3/17/2015 2:27 PM, bugs@schedmd.com wrote:
>
> *Comment # 12 <http://bugs.schedmd.com/show_bug.cgi?id=1534#c12> on 
> bug 1534 <http://bugs.schedmd.com/show_bug.cgi?id=1534> from David 
> Bigagli <mailto:david@schedmd.com> *
> We are debugging on our side. Would it be possible to run slurmdbd under
> valgrind?
>
> >valgrind --leak-check=full --show-reachable=yes --log-file=Log ./slurmdbd
>
> Under what memory limit you run slurmdbd if any? ulimit -a.
>
> David
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 14 Paul Edmon 2015-03-18 02:16:45 MDT
Okay, I have it running now.  Just have to wait and see if it reproduces.


-Paul Edmon-

On 03/17/2015 02:27 PM, bugs@schedmd.com wrote:
>
> *Comment # 12 <http://bugs.schedmd.com/show_bug.cgi?id=1534#c12> on 
> bug 1534 <http://bugs.schedmd.com/show_bug.cgi?id=1534> from David 
> Bigagli <mailto:david@schedmd.com> *
> We are debugging on our side. Would it be possible to run slurmdbd under
> valgrind?
>
> >valgrind --leak-check=full --show-reachable=yes --log-file=Log ./slurmdbd
>
> Under what memory limit you run slurmdbd if any? ulimit -a.
>
> David
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 15 David Bigagli 2015-03-18 08:04:49 MDT
Do you see anything in the valgrind log file? Say you run sacct does valgrind print something?

David
Comment 16 Paul Edmon 2015-03-18 08:08:00 MDT
Created attachment 1740 [details]
valgrind Log
Comment 17 Paul Edmon 2015-03-18 08:09:34 MDT
Just attached the valgrind Log currently.  slurmdbd is still running with valgrind, but nothing else has been printed yet and slurmdbd hasn't run away in terms of memory currently.
Comment 18 David Bigagli 2015-03-18 08:16:30 MDT
The log looks good up to now. Still reachable and global variables and possibly lost are caches, these messages are expected.

David
Comment 19 Paul Edmon 2015-03-20 01:28:52 MDT
Created attachment 1745 [details]
valgrind Log 2
Comment 20 Paul Edmon 2015-03-20 01:29:58 MDT
So valgrind exited yesterday.  We have slurmdbd running now with out valgrind.  I don't know if there was memory leak detected, but the log is attached.
Comment 21 David Bigagli 2015-03-20 05:28:46 MDT
The log does not show leaks. We are still investigating and will update this
problem as soon as we found something.

David
Comment 22 Paul Edmon 2015-03-20 05:29:53 MDT
Thanks.  I'm not sure why valgrind closed then.  Does it have a built in 
time limit?  Let me know if you want me to do anything else from my end.

-Paul Edmon-

On 03/20/2015 01:28 PM, bugs@schedmd.com wrote:
>
> *Comment # 21 <http://bugs.schedmd.com/show_bug.cgi?id=1534#c21> on 
> bug 1534 <http://bugs.schedmd.com/show_bug.cgi?id=1534> from David 
> Bigagli <mailto:david@schedmd.com> *
> The log does not show leaks. We are still investigating and will update this
> problem as soon as we found something.
>
> David
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 23 David Bigagli 2015-03-20 05:37:33 MDT
I don't think valgrind has a run time limit. Perhaps the slurmdbd exit, is there
anything in the slurmdbd log file?

David
Comment 24 Paul Edmon 2015-03-20 05:41:02 MDT
Ah, looks like puppet killed it and restarted it since service slurmdbd 
status would have shown up as not operating.  So it was innocuous.

-Paul Edmon-

On 03/20/2015 01:37 PM, bugs@schedmd.com wrote:
>
> *Comment # 23 <http://bugs.schedmd.com/show_bug.cgi?id=1534#c23> on 
> bug 1534 <http://bugs.schedmd.com/show_bug.cgi?id=1534> from David 
> Bigagli <mailto:david@schedmd.com> *
> I don't think valgrind has a run time limit. Perhaps the slurmdbd exit, is
> there
> anything in the slurmdbd log file?
>
> David
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 25 Paul Edmon 2015-03-23 03:13:55 MDT
Just FYI on the current status of this.  I haven't seen this repeat 
itself recently.  Will let you know if it does.  We are planning on 
upgrading to 14.11.5 on April 7th.  Let me know if we should not proceed.

-Paul Edmon-

On 03/20/2015 01:37 PM, bugs@schedmd.com wrote:
>
> *Comment # 23 <http://bugs.schedmd.com/show_bug.cgi?id=1534#c23> on 
> bug 1534 <http://bugs.schedmd.com/show_bug.cgi?id=1534> from David 
> Bigagli <mailto:david@schedmd.com> *
> I don't think valgrind has a run time limit. Perhaps the slurmdbd exit, is
> there
> anything in the slurmdbd log file?
>
> David
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 26 David Bigagli 2015-03-23 04:15:29 MDT
I think upgrading is fine. We ran slurmdbd under different configurations with valgrind and found no leaks. What is the memory limit under which you run slurmdbd, you can see it ulimit -a from the shell that starts the daemon.

David
Comment 27 Paul Edmon 2015-03-23 05:06:13 MDT
So this runs as the slurm user in our environment

bash-4.1$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2066240
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

-Paul Edmon-

On 03/23/2015 12:15 PM, bugs@schedmd.com wrote:
>
> *Comment # 26 <http://bugs.schedmd.com/show_bug.cgi?id=1534#c26> on 
> bug 1534 <http://bugs.schedmd.com/show_bug.cgi?id=1534> from David 
> Bigagli <mailto:david@schedmd.com> *
> I think upgrading is fine. We ran slurmdbd under different configurations with
> valgrind and found no leaks. What is the memory limit under which you run
> slurmdbd, you can see it ulimit -a from the shell that starts the daemon.
>
> David
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 28 David Bigagli 2015-03-27 06:36:56 MDT
Hi,
  this cannot be reproduced anymore? In this case should we close this ticket and reopen it if it appears again.

David
Comment 29 Paul Edmon 2015-03-27 06:52:47 MDT
Yeah, I haven't seen the leak since.  Will let you know if it reoccurs.

-Paul Edmon-

On 03/27/2015 02:36 PM, bugs@schedmd.com wrote:
>
> *Comment # 28 <http://bugs.schedmd.com/show_bug.cgi?id=1534#c28> on 
> bug 1534 <http://bugs.schedmd.com/show_bug.cgi?id=1534> from David 
> Bigagli <mailto:david@schedmd.com> *
> Hi,
>    this cannot be reproduced anymore? In this case should we close this ticket
> and reopen it if it appears again.
>
> David
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 30 David Bigagli 2015-03-27 06:55:40 MDT
Close for now.

David