5064 – slurmctld failing with pthread_create error Resource temporarily unavailable

Bug 5064 - slurmctld failing with pthread_create error Resource temporarily unavailable

Summary: slurmctld failing with pthread_create error Resource temporarily unavailable

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other bugs)
Version:	17.11.2
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Alejandro Sanchez
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-04-12 22:37 MDT by James Powell
Modified:	2018-04-18 03:03 MDT (History)
CC List:	1 user (show)

See Also:	5068
Site:	CSIRO
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (10.26 KB, text/plain) 2018-04-12 22:41 MDT, James Powell	Details
environment (2.38 KB, text/plain) 2018-04-12 22:43 MDT, James Powell	Details
slurmctld log (18.44 MB, application/x-gzip) 2018-04-12 22:54 MDT, James Powell	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description James Powell 2018-04-12 22:37:26 MDT

Hi Support,

we recently upgraded to 17.11.2 (from 16.05.x, as part of Bright Cluster Manager upgrade) and as the users have started submitting jobs (1000+) we're experiencing the failure of slurmctld with the following errors;

fatal: prolog_slurmctld: pthread_create error Resource temporarily unavailable

or 

fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable

seeming to occur just before scheduling, leaving our cluster idle.

Any advice on what to look at would be most appreciated.

I've tried setting default "SchedulerParameters" (commenting out this line in slurm.conf) and that made no difference.

After some experimentation reducing the number of jobs to be considered for scheduling does help (in that it at least allows a couple of scheduling cycles before failing).

So, as a temporary workaround, I'm able to get slurmctld to run though a couple of scheduling cycles before failing by limiting all users to 100 running jobs (via MAXJOBS). As Bright restarts slurmctld upon failure, we're able to continue processing for now.

Will attach slurm.conf, logs & some environment settings.

Cheers

James

Comment 1 James Powell 2018-04-12 22:41:45 MDT

Created attachment 6624 [details]
slurm.conf

Comment 2 James Powell 2018-04-12 22:43:43 MDT

Created attachment 6625 [details]
environment

Comment 3 James Powell 2018-04-12 22:54:27 MDT

Created attachment 6626 [details]
slurmctld log

Comment 4 Alejandro Sanchez 2018-04-13 03:16:59 MDT

Hi James. Looking at the slurmctld limits you reported:

cm01:~ # cat /proc/3438/limits
Limit                     Soft Limit           Hard Limit           Units
...
Max processes             515222               515222               processes
Max open files            4096                 4096                 files
...

Will you try increasing the Max open files then restart the daemon and see if things improve?

Also looking at yout logs I see this a couple of times:

[2018-04-13T14:26:08.768] error: chdir(/var/log): Permission denied

can you check the permissions there?

There are a bunch of errors too due to slurm.conf not being consistent across all nodes in the cluster. Please make sure it's in sync. Thanks.

Comment 5 James Powell 2018-04-15 17:26:39 MDT

(In reply to Alejandro Sanchez from comment #4)
> Hi James. Looking at the slurmctld limits you reported:
> 
> cm01:~ # cat /proc/3438/limits
> Limit                     Soft Limit           Hard Limit           Units
> ...
> Max processes             515222               515222               processes
> Max open files            4096                 4096                 files
> ...
> 
> Will you try increasing the Max open files then restart the daemon and see
> if things improve?

cm01:~ # systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; disabled; vendor preset: disabled)
   Active: active (running) since Mon 2018-04-16 09:12:09 AEST; 32s ago
  Process: 18863 ExecStart=/cm/shared/apps/slurm/17.11.2/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 18868 (slurmctld)
    Tasks: 15 (limit: 512)
   CGroup: /system.slice/slurmctld.service
           └─18868 /cm/shared/apps/slurm/17.11.2/sbin/slurmctld
...
cm01:~ # cat /proc/18868/limits 
Limit                     Soft Limit           Hard Limit           Units     
...
Max processes             515222               515222               processes 
Max open files            65536                65536                files    
...
cm01:~ # grep fatal /var/log/slurmctld
[2018-04-16T00:16:00.092] fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable
[2018-04-16T00:34:00.160] fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable
...
[2018-04-16T09:00:00.118] fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable
[2018-04-16T09:12:00.081] fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable

Less frequent fatal errors, but I suspect that may be as a consequence of a low number of queued jobs due to the weekend

> Also looking at yout logs I see this a couple of times:
> 
> [2018-04-13T14:26:08.768] error: chdir(/var/log): Permission denied
> 
> can you check the permissions there?

cm01:~ # ls -ld /var/log/
drwxr-xr-x 23 root root 12288 Apr 16 00:03 /var/log/
cm01:~ # ls -l /var/log/slurmctld
-rw-r----- 1 slurm slurm 5110471 Apr 16 09:20 /var/log/slurmctld

Changing /var/log to 777 permissions does remove that message in the slurmctld log
 
> There are a bunch of errors too due to slurm.conf not being consistent
> across all nodes in the cluster. Please make sure it's in sync. Thanks.

All nodes in the cluster use a link to the same slurm.conf on shared storage. We see this error occasionally & it's puzzling

Cheers

James

Comment 6 Alejandro Sanchez 2018-04-16 03:44:29 MDT

A couple of more suggestions:

Is the slurmctld.service file configured with:
TasksMax=infinity in the [Service] section?

Check/increase the system-wide:
/proc/sys/kernel/threads-max

Increase even more Max open files.

Comment 7 James Powell 2018-04-17 00:10:32 MDT

(In reply to Alejandro Sanchez from comment #6)
> A couple of more suggestions:
> 
> Is the slurmctld.service file configured with:
> TasksMax=infinity in the [Service] section?

No, added

cm01:~ # cat /usr/lib/systemd/system/slurmctld.service
...
[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmctld
ExecStart=/cm/shared/apps/slurm/17.11.2/sbin/slurmctld $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmctld.pid
LimitNOFILE=262144
TasksMax=infinity
...

> Check/increase the system-wide:
> /proc/sys/kernel/threads-max

cm01:~ # cat /proc/sys/kernel/threads-max
1030760

That's sufficient I think

> Increase even more Max open files.

Increased from 64k to 256k

Either the increase in Max open files or the addition of TasksMax=infinity has settled slurmctld. 5h20m so far without a fatal error

cm01:~ # systemctl status slurmctld.service 
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; disabled; vendor preset: disabled)
   Active: active (running) since Tue 2018-04-17 10:46:25 AEST; 5h 20min ago
...

cm01:~ # cat /proc/34530/limits 
Limit                     Soft Limit           Hard Limit           Units     
... 
Max processes             515222               515222               processes 
Max open files            262144               262144               files     
...

cm01:~ # systemctl show -p TasksMax slurmctld.service 
TasksMax=18446744073709551615

I'll increase MAXJOBS to 1000 (where we were before upgrading) and see if we remain stable. Appreciate the help.

Cheers

James

Comment 8 James Powell 2018-04-17 18:43:35 MDT

Been 24Hrs since making the changes & no further fatal errors. I'm calling it solved. Thanks Support.

Cheers

James

Comment 9 Alejandro Sanchez 2018-04-18 03:03:25 MDT

(In reply to James Powell from comment #8)
> Been 24Hrs since making the changes & no further fatal errors. I'm calling
> it solved. Thanks Support.
> 
> Cheers
> 
> James

Glad to see that. Closing the bug, please reopen if needed.