10613 – task_p_pre_launch error

Ticket 10613 - task_p_pre_launch error

Summary: task_p_pre_launch error

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	User Commands (show other tickets)
Version:	20.11.2
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-01-12 12:23 MST by Paul Edmon
Modified:	2021-02-23 17:07 MST (History)
CC List:	2 users (show)

See Also:	9244 10487 10581 10460 10640 9211 10932
Site:	Harvard University
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	20.11.3
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
runlog for busted run. (52.59 KB, text/plain) 2021-01-12 12:23 MST, Paul Edmon	Details
slurm.conf (72.09 KB, text/x-matlab) 2021-01-12 18:11 MST, Paul Edmon	Details
slurmd.log (36.88 KB, text/plain) 2021-01-13 12:42 MST, Paul Edmon	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Paul Edmon 2021-01-12 12:23:17 MST

Created attachment 17437 [details]
runlog for busted run.

We upgraded to 20.11.2 commit bde072c607 on Monday as part of our normal upgrade pattern.  The upgrade for the most part has gone well but users have been noticing the following error:

[root@holy7c22501 ~]# srun -n 1 --pty -p serial_requeue -w seasmicro26 /bin/bash
srun: job 12914079 queued and waiting for resources
srun: job 12914079 has been allocated resources
slurmstepd: error: Failed to invoke task plugins: task_p_pre_launch error
srun: error: seasmicro26: task 0: Exited with exit code 1

It doesn't happen all the time and it only happens on certain nodes.  I haven't figured out what is going on.  I've attached a debug run for you to look over. Let me know if you have any ideas as to what is happening.

-Paul Edmon-

Comment 1 Michael Hinton 2021-01-12 13:13:02 MST

Hi Paul,

We've seen this error crop up in 20.11 with a few customers and are trying to get to the bottom of it.

Could you attach your current slurm.conf?

Thanks,
-Michael

Comment 2 Paul Edmon 2021-01-12 18:11:30 MST

Created attachment 17446 [details]
slurm.conf

Comment 3 Michael Hinton 2021-01-13 10:05:53 MST

Hey Paul,

When you run slurmd in the foreground (slurmd -D), it actually does not pick up all the logs (the stepd logs do not get printed). So could you attach the slurmd.log snippet that corresponds to the busted run? The stepd logs should tell us more information about what caused the task_p_pre_launch error.

Thanks,
Michael

Comment 4 Paul Edmon 2021-01-13 10:51:32 MST

So the slurm log does show any slurmstepd stuff:

Jan 12 14:14:23 seasmicro26 slurmd[31823]: Node configuration differs from hardware: CPUs=12:24(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw) CoresPerSocket=6:$
Jan 12 14:14:23 seasmicro26 slurmd[31823]: Can not stat gres.conf file (/etc/slurm/gres.conf), using slurm.conf data
Jan 12 14:14:23 seasmicro26 slurmd[31823]: topology/tree: init: topology tree plugin loaded
Jan 12 14:14:23 seasmicro26 slurmd[31823]: route/topology: init: route topology plugin loaded
Jan 12 14:14:23 seasmicro26 slurmd[31823]: Resource spec: system cgroup memory limit set to 4096 MB
Jan 12 14:14:23 seasmicro26 slurmd[31823]: task/affinity: init: task affinity plugin loaded with CPU mask 0xffffff
Jan 12 14:14:23 seasmicro26 slurmd[31823]: cred/munge: init: Munge credential signature plugin loaded
Jan 12 14:14:23 seasmicro26 slurmd[31823]: slurmd version 20.11.2 started
Jan 12 14:14:23 seasmicro26 slurmd[31823]: slurmd started on Tue, 12 Jan 2021 14:14:23 -0500
Jan 12 14:14:24 seasmicro26 slurmd[31823]: CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1 Memory=128664 TmpDisk=901823 Uptime=51395 CPUSpecList=(null) Feature$
Jan 12 14:14:31 seasmicro26 slurmd[31823]: launch task StepId=12913850.0 request from UID:0 GID:0 HOST:10.31.142.192 PORT:59652
Jan 12 14:14:31 seasmicro26 slurmd[31823]: task/affinity: lllp_distribution: JobId=12913850 implicit auto binding: cores,one_thread, dist 8192
Jan 12 14:14:31 seasmicro26 slurmd[31823]: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic
Jan 12 14:14:31 seasmicro26 slurmd[31823]: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [12913850]: mask_cpu,one_thread, 0x001000
Jan 12 14:14:34 seasmicro26 slurm_epilog[31924]: Starting epilog for job 12913850 by root...
Jan 12 14:14:34 seasmicro26 slurm_epilog[31929]: Epilog for job 12913850 by root complete.
Jan 12 14:14:37 seasmicro26 slurmd[31823]: got shutdown request

That's the log for that job.  Odds are I don't have the debug flags set correctly.  I'm happy to run another test job, just let me know what debug flags you want.

Comment 5 Michael Hinton 2021-01-13 11:25:53 MST

(In reply to Paul Edmon from comment #4)
> That's the log for that job.  Odds are I don't have the debug flags set
> correctly.  I'm happy to run another test job, just let me know what debug
> flags you want.
Hmm, you should see some logs from the step. Set slurmd log level to debug2, restart the slurmd, and reproduce the problem. I'm expecting the slurmd.log to reveal something like this:

[2020-12-17T14:21:42.676] [384.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-12-17T14:21:42.677] [384.0] sched_setaffinity(18992,128,0x2) failed: Invalid argument
[2020-12-17T14:21:42.677] [384.0] debug:  task_g_pre_launch: task/affinity: Unspecified error
[2020-12-17T14:21:42.677] [384.0] error: Failed to invoke task plugins: task_p_pre_launch error

Comment 6 Paul Edmon 2021-01-13 11:40:15 MST

This is what I get with debug set to debug2.

-Paul Edmon-


Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug2: _insert_job_state: we 
already have a job state for job 13106940. No big deal, just an FYI.
Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug2: Finish processing RPC: 
REQUEST_LAUNCH_PROLOG
Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug2: Start processing RPC: 
REQUEST_LAUNCH_TASKS
Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug2: Processing RPC: 
REQUEST_LAUNCH_TASKS
Jan 13 13:32:52 seasmicro26 slurmd[9074]: launch task StepId=13106940.0 
request from UID:0 GID:0 HOST:10.31.142.192 PORT:41444
Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug:  Checking credential 
with 404 bytes of sig data
Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug:  task/affinity: 
task_p_slurmd_launch_request: task affinity : before lllp distribution 
cpu bind method is '(null type)' ((null))
Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug:  task/affinity: 
lllp_distribution: binding tasks:1 to nodes:0 sockets:0:1 cores:1:0 
threads:1
Jan 13 13:32:52 seasmicro26 slurmd[9074]: task/affinity: 
lllp_distribution: JobId=13106940 implicit auto binding: 
cores,one_thread, dist 8192
Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug2: task/affinity: 
lllp_distribution: JobId=13106940 will use lllp_cyclic because of 
SelectTypeParameters
Jan 13 13:32:52 seasmicro26 slurmd[9074]: task/affinity: 
_task_layout_lllp_cyclic: _task_layout_lllp_cyclic
Jan 13 13:32:52 seasmicro26 slurmd[9074]: task/affinity: 
_lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [13106940]: 
mask_cpu,one_thread, 0x002000
Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug:  task/affinity: 
task_p_slurmd_launch_request: task affinity : after lllp distribution 
cpu bind method is 'mask_cpu,one_thread' (0x002000)
Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug:  Waiting for job 
13106940's prolog to complete
Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug:  Finished wait for job 
13106940's prolog to complete
Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug2: Finish processing RPC: 
REQUEST_LAUNCH_TASKS
Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug2: Start processing RPC: 
REQUEST_TERMINATE_JOB
Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug2: Processing RPC: 
REQUEST_TERMINATE_JOB
Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug: _rpc_terminate_job, uid 
= 57812 JobId=13106940
Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug:  credential for job 
13106940 revoked
Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug2: container signal 999 
to StepId=13106940.extern
Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug2: container signal 18 to 
StepId=13106940.extern
Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug2: container signal 15 to 
StepId=13106940.extern
Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug2: set revoke expiration 
for jobid 13106940 to 1610562893 UTS
Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug:  Waiting for job 
13106940's prolog to complete
Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug:  Finished wait for job 
13106940's prolog to complete
Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug:  [job 13106940] 
attempting to run epilog [/usr/local/bin/slurm_epilog]
Jan 13 13:32:53 seasmicro26 slurm_epilog[9947]: Starting epilog for job 
13106940 by root...
Jan 13 13:32:53 seasmicro26 slurm_epilog[9951]: Epilog for job 13106940 
by root complete.
Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug:  completed epilog for 
jobid 13106940
Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug:  JobId=13106940: sent 
epilog complete msg: rc = 0
Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug2: Finish processing RPC: 
REQUEST_TERMINATE_JOB

On 1/13/2021 1:25 PM, bugs@schedmd.com wrote:
>
> *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=10613#c5> on 
> bug 10613 <https://bugs.schedmd.com/show_bug.cgi?id=10613> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> (In reply to Paul Edmon fromcomment #4  <show_bug.cgi?id=10613#c4>)
> > That's the log for that job.  Odds are I don't have the debug flags set > correctly. I'm happy to run another test job, just let me know what 
> debug > flags you want.
> Hmm, you should see some logs from the step. Set slurmd log level to debug2,
> restart the slurmd, and reproduce the problem. I'm expecting the slurmd.log to
> reveal something like this:
>
> [2020-12-17T14:21:42.676] [384.0] task_p_pre_launch: Using sched_affinity for
> tasks
> [2020-12-17T14:21:42.677] [384.0] sched_setaffinity(18992,128,0x2) failed:
> Invalid argument
> [2020-12-17T14:21:42.677] [384.0] debug:  task_g_pre_launch: task/affinity:
> Unspecified error
> [2020-12-17T14:21:42.677] [384.0] error: Failed to invoke task plugins:
> task_p_pre_launch error
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 7 Michael Hinton 2021-01-13 12:01:09 MST

Where are you getting this log output from? Is this a log aggregated by systemd or something? You should be seeing step logs ([13106940.0] ...), so it seems the log output you are using doesn't include this. You should be accessing the actual slurmd.log file for the node to get the full logs.

Comment 8 Paul Edmon 2021-01-13 12:21:49 MST

We have everything log via syslog so it should end up in journalctl 
which is then dumped to /var/log/messages.  We don't have a separate 
file for logging.  When I pull from journalctl:


Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: launch 
task StepId=13109913.0 request from UID:0 GID:0 HOST:10.31.142.192 
PORT:41912
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug:  
Checking credential with 404 bytes of sig data
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug:  
task/affinity: task_p_slurmd_launch_request: task affinity : before lllp 
distribution cpu bind method is '(null type)' ((null))
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug:  
task/affinity: lllp_distribution: binding tasks:1 to nodes:0 sockets:0:1 
cores:1:0 threads:1
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: 
task/affinity: lllp_distribution: JobId=13109913 implicit auto binding: 
cores,one_thread, dist 8192
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug2: 
task/affinity: lllp_distribution: JobId=13109913 will use lllp_cyclic 
because of SelectTypeParameters
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: 
task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: 
task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid 
[13109913]: mask_cpu,one_thread, 0x001000
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug:  
task/affinity: task_p_slurmd_launch_request: task affinity : after lllp 
distribution cpu bind method is 'mask_cpu,one_thread' (0x001000)
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug:  
Waiting for job 13109913's prolog to complete
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug:  
Finished wait for job 13109913's prolog to complete
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug2: 
Finish processing RPC: REQUEST_LAUNCH_TASKS
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu [23841]: [13109913.0]: 
pam_unix(slurm:session): session opened for user root by (uid=0)
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu [23841]: [13109913.0]: 
pam_unix(slurm:session): session closed for user root
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug2: 
Start processing RPC: REQUEST_TERMINATE_JOB
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug2: 
Processing RPC: REQUEST_TERMINATE_JOB
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug:  
_rpc_terminate_job, uid = 57812 JobId=13109913
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug:  
credential for job 13109913 revoked
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug2: 
container signal 999 to StepId=13109913.extern
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug:  
_step_connect: connect() failed for 
/var/slurmd/spool/slurmd/seasmicro26_13109913.0: No such file or directory
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug:  
signal for nonexistent StepId=13109913.0 stepd_connect failed: No such 
file or directory
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug2: 
container signal 18 to StepId=13109913.extern
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug2: 
container signal 15 to StepId=13109913.extern
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug2: 
set revoke expiration for jobid 13109913 to 1610565745 UTS
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug:  
Waiting for job 13109913's prolog to complete
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug:  
Finished wait for job 13109913's prolog to complete
Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug:  
[job 13109913] attempting to run epilog [/usr/local/bin/slurm_epilog]

So that contains the step information.

-Paul Edmon-

On 1/13/2021 2:01 PM, bugs@schedmd.com wrote:
>
> *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=10613#c7> on 
> bug 10613 <https://bugs.schedmd.com/show_bug.cgi?id=10613> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> Where are you getting this log output from? Is this a log aggregated by systemd
> or something? You should be seeing step logs ([13106940.0] ...), so it seems
> the log output you are using doesn't include this. You should be accessing the
> actual slurmd.log file for the node to get the full logs.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 9 Paul Edmon 2021-01-13 12:22:52 MST

Just a comment, it would be awesome if this info could just be part of 
the normal log stream rather than having to jump through hoops.

-Paul Edmon-

On 1/13/2021 2:01 PM, bugs@schedmd.com wrote:
>
> *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=10613#c7> on 
> bug 10613 <https://bugs.schedmd.com/show_bug.cgi?id=10613> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> Where are you getting this log output from? Is this a log aggregated by systemd
> or something? You should be seeing step logs ([13106940.0] ...), so it seems
> the log output you are using doesn't include this. You should be accessing the
> actual slurmd.log file for the node to get the full logs.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 10 Michael Hinton 2021-01-13 12:32:55 MST

(In reply to Paul Edmon from comment #8)
> We have everything log via syslog so it should end up in journalctl 
> which is then dumped to /var/log/messages.  We don't have a separate 
> file for logging.  When I pull from journalctl:
> ...
> So that contains the step information.
It seems to me that syslog or whatever is grepping for "slurm". At any rate, these logs do not have the complete stepd log output, so we can't proceed until we have that. Do you know the location of the slurmd.log file for the node the job ran on?

Comment 11 Michael Hinton 2021-01-13 12:36:26 MST

(In reply to Paul Edmon from comment #9)
> Just a comment, it would be awesome if this info could just be part of 
> the normal log stream rather than having to jump through hoops.
Feel free to open up an enhancement request. Maybe there is something we can improve in this area that makes sense.

Comment 12 Paul Edmon 2021-01-13 12:42:57 MST

Created attachment 17462 [details]
slurmd.log

As noted previously we did not have SlurmdLogFile set in our slurm.conf as we assumed that if that wasn't set everything would be routed to syslog via systemd.  If that isn't true it should be fixed as that isn't the correct behavior.

Anyways I've set:


### Logging
SlurmdSyslogDebug=debug2
SlurmdDebug=debug2
SlurmdLogFile=/var/log/slurmd.log

In the slurm.conf seasmicro26 and run a job.  Here is the result.

Comment 13 Paul Edmon 2021-01-13 12:44:55 MST

Looks like this time I caught the error you were looking for:

[2021-01-13T14:38:58.292] [13110853.0] debug:  Sending launch resp rc=0
[2021-01-13T14:38:58.293] [13110853.0] debug:  mpi type = (null)
[2021-01-13T14:38:58.293] [13110853.0] debug:  mpi/none: p_mpi_hook_slurmstepd_task: Using mpi/none
[2021-01-13T14:38:58.293] [13110853.0] debug:  task/affinity: task_p_pre_launch: affinity StepId=13110853.0, task:0 bind:8448
[2021-01-13T14:38:58.294] [13110853.0] task/affinity: slurm_setaffinity: sched_setaffinity(28695,128,0x2) failed: Invalid argument
[2021-01-13T14:38:58.294] [13110853.0] debug:  task_g_pre_launch: task/affinity: Unspecified error
[2021-01-13T14:38:58.294] [13110853.0] error: Failed to invoke task plugins: task_p_pre_launch error

Comment 14 Michael Hinton 2021-01-13 12:56:54 MST

(In reply to Paul Edmon from comment #12)
> As noted previously we did not have SlurmdLogFile set in our slurm.conf as
> we assumed that if that wasn't set everything would be routed to syslog via
> systemd.  If that isn't true it should be fixed as that isn't the correct
> behavior.
Hmm, our docs do say that everything will be logged to syslog. My guess is that there a step between syslog and the journalctld output you shared in comment 8 that greps syslog for "slurm" (maybe this is a bad assumption on the part of journalctld?).

If we do not log everything to syslog like we promise, that sounds like a bug. Would you mind opening up a separate ticket for this issue?

Comment 15 Paul Edmon 2021-01-13 12:58:06 MST

Yup, already did: https://bugs.schedmd.com/show_bug.cgi?id=10625

-Paul Edmon-

On 1/13/2021 2:56 PM, bugs@schedmd.com wrote:
>
> *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=10613#c14> on 
> bug 10613 <https://bugs.schedmd.com/show_bug.cgi?id=10613> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> (In reply to Paul Edmon fromcomment #12  <show_bug.cgi?id=10613#c12>)
> > As noted previously we did not have SlurmdLogFile set in our slurm.conf as > we assumed that if that wasn't set everything would be routed to 
> syslog via > systemd. If that isn't true it should be fixed as that 
> isn't the correct > behavior.
> Hmm, our docs do say that everything will be logged to syslog. My guess is that
> there a step between syslog and the journalctld output you shared incomment 8  <show_bug.cgi?id=10613#c8>
> that greps syslog for "slurm" (maybe this is a bad assumption on the part of
> journalctld?).
>
> If we do not log everything to syslog like we promise, that sounds like a bug.
> Would you mind opening up a separate ticket for this issue?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 16 Michael Hinton 2021-01-13 13:03:34 MST

(In reply to Paul Edmon from comment #13)
> Looks like this time I caught the error you were looking for:
> 
> [2021-01-13T14:38:58.292] [13110853.0] debug:  Sending launch resp rc=0
> [2021-01-13T14:38:58.293] [13110853.0] debug:  mpi type = (null)
> [2021-01-13T14:38:58.293] [13110853.0] debug:  mpi/none:
> p_mpi_hook_slurmstepd_task: Using mpi/none
> [2021-01-13T14:38:58.293] [13110853.0] debug:  task/affinity:
> task_p_pre_launch: affinity StepId=13110853.0, task:0 bind:8448
> [2021-01-13T14:38:58.294] [13110853.0] task/affinity: slurm_setaffinity:
> sched_setaffinity(28695,128,0x2) failed: Invalid argument
> [2021-01-13T14:38:58.294] [13110853.0] debug:  task_g_pre_launch:
> task/affinity: Unspecified error
> [2021-01-13T14:38:58.294] [13110853.0] error: Failed to invoke task plugins:
> task_p_pre_launch error
Excellent, thank you. Basically, this error is saying that the task/affinity plugin is trying to bind (sched_setaffinity()) to CPU 1 (CPU mask 0x2) but is unable to, likely because it isn't in Slurm's cpuset cgroup.

We have another site that ran into this issue, and for them the issue was that an HPC file system called Weka created a cgroup that reserved CPU 1, preventing Slurm from using it.

On the node in question, could you please run this command and post the output?:

find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \;

Thanks

Comment 17 Paul Edmon 2021-01-13 13:05:02 MST

Interesting.  Here is the result:

[root@seasmicro26 ~]# find /sys/fs/cgroup/ -name cpuset.cpus -exec echo 
'{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/slurm/cpuset.cpus
0-23
/sys/fs/cgroup/cpuset/cpuset.cpus
0-23

On 1/13/2021 3:03 PM, bugs@schedmd.com wrote:
>
> *Comment # 16 <https://bugs.schedmd.com/show_bug.cgi?id=10613#c16> on 
> bug 10613 <https://bugs.schedmd.com/show_bug.cgi?id=10613> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> (In reply to Paul Edmon fromcomment #13  <show_bug.cgi?id=10613#c13>)
> > Looks like this time I caught the error you were looking for: > > [2021-01-13T14:38:58.292] [13110853.0] debug: Sending launch resp 
> rc=0 > [2021-01-13T14:38:58.293] [13110853.0] debug: mpi type = (null) 
> > [2021-01-13T14:38:58.293] [13110853.0] debug: mpi/none: > 
> p_mpi_hook_slurmstepd_task: Using mpi/none > [2021-01-13T14:38:58.293] 
> [13110853.0] debug: task/affinity: > task_p_pre_launch: affinity 
> StepId=13110853.0, task:0 bind:8448 > [2021-01-13T14:38:58.294] 
> [13110853.0] task/affinity: slurm_setaffinity: > 
> sched_setaffinity(28695,128,0x2) failed: Invalid argument > 
> [2021-01-13T14:38:58.294] [13110853.0] debug: task_g_pre_launch: > 
> task/affinity: Unspecified error > [2021-01-13T14:38:58.294] 
> [13110853.0] error: Failed to invoke task plugins: > task_p_pre_launch 
> error
> Excellent, thank you. Basically, this error is saying that the task/affinity
> plugin is trying to bind (sched_setaffinity()) to CPU 1 (CPU mask 0x2) but is
> unable to, likely because it isn't in Slurm's cpuset cgroup.
>
> We have another site that ran into this issue, and for them the issue was that
> an HPC file system called Weka created a cgroup that reserved CPU 1, preventing
> Slurm from using it.
>
> On the node in question, could you please run this command and post the
> output?:
>
> find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \;
>
> Thanks
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 18 Paul Edmon 2021-01-13 13:09:48 MST

Something interesting I noticed.  This host is defined in slurm.conf as 
not having hyperthreading on:

[root@seasmicro26 ~]# scontrol show node seasmicro26
NodeName=seasmicro26 Arch=x86_64 CoresPerSocket=6
    CPUAlloc=0 CPUTot=12 CPULoad=0.03
    AvailableFeatures=intel,haswell,avx,avx2
    ActiveFeatures=intel,haswell,avx,avx2
    Gres=(null)
    NodeAddr=seasmicro26 NodeHostName=seasmicro26 Version=20.11.2
    OS=Linux 3.10.0-1127.18.2.el7.x86_64 #1 SMP Sun Jul 26 15:27:06 UTC 2020
    RealMemory=128666 AllocMem=0 FreeMem=118873 Sockets=2 Boards=1
    MemSpecLimit=4096
    State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=869376 Weight=1 Owner=N/A 
MCS_label=N/A
    Partitions=fdoyle_compute,emergency,serial_requeue
    BootTime=2021-01-11T23:57:49 SlurmdStartTime=2021-01-13T14:38:34
    CfgTRES=cpu=12,mem=128666M,billing=21
    AllocTRES=
    CapWatts=n/a
    CurrentWatts=0 AveWatts=0
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
    Reason=plugin issues [root@2021-01-13T14:41:34]
    Comment=(null)

Yet that CPU set and this lscpu show it on:

[root@seasmicro26 ~]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
Stepping:              2
CPU MHz:               1200.000
CPU max MHz:           3200.0000
CPU min MHz:           1200.0000
BogoMIPS:              4800.22
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              15360K
NUMA node0 CPU(s):     0-5,12-17
NUMA node1 CPU(s):     6-11,18-23
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr 
pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good 
nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 
monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca 
sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer xsave avx f16c 
rdrand lahf_lm abm epb invpcid_single intel_ppin ssbd ibrs ibpb stibp 
tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep 
bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln 
pts md_clear spec_ctrl intel_stibp flush_l1d
[root@seasmicro26 ~]# ^C

I wonder if that is the source of the problem.


-Paul Edmon-


On 1/13/2021 3:03 PM, bugs@schedmd.com wrote:
>
> *Comment # 16 <https://bugs.schedmd.com/show_bug.cgi?id=10613#c16> on 
> bug 10613 <https://bugs.schedmd.com/show_bug.cgi?id=10613> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> (In reply to Paul Edmon fromcomment #13  <show_bug.cgi?id=10613#c13>)
> > Looks like this time I caught the error you were looking for: > > [2021-01-13T14:38:58.292] [13110853.0] debug: Sending launch resp 
> rc=0 > [2021-01-13T14:38:58.293] [13110853.0] debug: mpi type = (null) 
> > [2021-01-13T14:38:58.293] [13110853.0] debug: mpi/none: > 
> p_mpi_hook_slurmstepd_task: Using mpi/none > [2021-01-13T14:38:58.293] 
> [13110853.0] debug: task/affinity: > task_p_pre_launch: affinity 
> StepId=13110853.0, task:0 bind:8448 > [2021-01-13T14:38:58.294] 
> [13110853.0] task/affinity: slurm_setaffinity: > 
> sched_setaffinity(28695,128,0x2) failed: Invalid argument > 
> [2021-01-13T14:38:58.294] [13110853.0] debug: task_g_pre_launch: > 
> task/affinity: Unspecified error > [2021-01-13T14:38:58.294] 
> [13110853.0] error: Failed to invoke task plugins: > task_p_pre_launch 
> error
> Excellent, thank you. Basically, this error is saying that the task/affinity
> plugin is trying to bind (sched_setaffinity()) to CPU 1 (CPU mask 0x2) but is
> unable to, likely because it isn't in Slurm's cpuset cgroup.
>
> We have another site that ran into this issue, and for them the issue was that
> an HPC file system called Weka created a cgroup that reserved CPU 1, preventing
> Slurm from using it.
>
> On the node in question, could you please run this command and post the
> output?:
>
> find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \;
>
> Thanks
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 19 Paul Edmon 2021-01-13 13:19:53 MST

I was right.  The mismatch in the slurm.conf to the actual config caused the problem.  When I turned off HT and then reran:

[root@holy7c22501 ~]# srun -n 1 --pty -p serial_requeue -w seasmicro26 /bin/bash
srun: Required node not available (down, drained or reserved)
srun: job 13116974 queued and waiting for resources
srun: job 13116974 has been allocated resources
[root@seasmicro26 ~]# exit
exit

Now it works correctly.  This is the downside of going with config_override.  I will have to get rid of that.  Anyways what this means is that I need to track down all the nodes that hyperthreading on and turn that off.

Comment 20 Michael Hinton 2021-01-13 13:45:38 MST

(In reply to Paul Edmon from comment #19)
> I was right.  The mismatch in the slurm.conf to the actual config caused the
> problem.  When I turned off HT and then reran:
> 
> [root@holy7c22501 ~]# srun -n 1 --pty -p serial_requeue -w seasmicro26
> /bin/bash
> srun: Required node not available (down, drained or reserved)
> srun: job 13116974 queued and waiting for resources
> srun: job 13116974 has been allocated resources
> [root@seasmicro26 ~]# exit
> exit
> 
> Now it works correctly.  This is the downside of going with config_override.
> I will have to get rid of that.  Anyways what this means is that I need to
> track down all the nodes that hyperthreading on and turn that off.
Excellent! I think you are correct.

I believe we made a change in 20.11 that inadvertently made this more strict (i.e. Slurm stopped believing the config and started looking at the CPUs actually on the machine, regardless of config_overrides). We're looking into it to see if that change was the right one. However, would you say that the outcome here is ultimately positive? Or did you not want Slurm to be aware of hyperthreading?

We usually do not recommend running with config_overrides in production, since it can mask hardware misconfigurations like this. So I'm glad you are moving away from that.

Comment 21 Paul Edmon 2021-01-13 13:50:21 MST

My intention was to actually remove config_overrides sooner but it was 
too chatty in the logs so I didn't.  It has now been demoted in terms of 
severity so it doesn't spam our logs anymore.

Something I did just notice though is that once I removed 
config_overrides it did not complain or close the nodes that had 
hyperthreading on but did not have the indicated in the slurm.conf.  
That should be fixed.  Slurm should close nodes that have that sort of 
mismatch.  I get it for Memory and Disk as underlisting memory isn't 
really a problem (and should be maintained as being completely accurate 
for that is almost impossible).  However as we just discovered having a 
core mismatch is more serious.  Apparently you need to get the number of 
cores and threads right else you will hit this problem.  So if there is 
a fix to put in to slurm about this, that would be it.  It should close 
nodes that do not have the proper number of cores/threads regardless of 
it is over or under listing.

-Paul Edmon-

On 1/13/2021 3:45 PM, bugs@schedmd.com wrote:
> Michael Hinton <mailto:hinton@schedmd.com> changed bug 10613 
> <https://bugs.schedmd.com/show_bug.cgi?id=10613>
> What 	Removed 	Added
> CC 		felip.moll@schedmd.com
>
> *Comment # 20 <https://bugs.schedmd.com/show_bug.cgi?id=10613#c20> on 
> bug 10613 <https://bugs.schedmd.com/show_bug.cgi?id=10613> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> (In reply to Paul Edmon fromcomment #19  <show_bug.cgi?id=10613#c19>)
> > I was right.  The mismatch in the slurm.conf to the actual config caused the > problem. When I turned off HT and then reran: > > [root@holy7c22501 
> ~]# srun -n 1 --pty -p serial_requeue -w seasmicro26 > /bin/bash > 
> srun: Required node not available (down, drained or reserved) > srun: 
> job 13116974 queued and waiting for resources > srun: job 13116974 has 
> been allocated resources > [root@seasmicro26 ~]# exit > exit > > Now 
> it works correctly. This is the downside of going with 
> config_override. > I will have to get rid of that. Anyways what this 
> means is that I need to > track down all the nodes that hyperthreading 
> on and turn that off.
> Excellent! I think you are correct.
>
> I believe we made a change in 20.11 that inadvertently made this more strict
> (i.e. Slurm stopped believing the config and started looking at the CPUs
> actually on the machine, regardless of config_overrides). We're looking into it
> to see if that change was the right one. However, would you say that the
> outcome here is ultimately positive? Or did you not want Slurm to be aware of
> hyperthreading?
>
> We usually do not recommend running with config_overrides in production, since
> it can mask hardware misconfigurations like this. So I'm glad you are moving
> away from that.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 23 Michael Hinton 2021-01-13 14:12:57 MST

(In reply to Paul Edmon from comment #21)
> Something I did just notice though is that once I removed 
> config_overrides it did not complain or close the nodes that had 
> hyperthreading on but did not have the indicated in the slurm.conf.  
> That should be fixed.  Slurm should close nodes that have that sort of 
> mismatch.  I get it for Memory and Disk as underlisting memory isn't 
> really a problem (and should be maintained as being completely accurate 
> for that is almost impossible).  However as we just discovered having a 
> core mismatch is more serious.  Apparently you need to get the number of 
> cores and threads right else you will hit this problem.  So if there is 
> a fix to put in to slurm about this, that would be it.  It should close 
> nodes that do not have the proper number of cores/threads regardless of 
> it is over or under listing.
What do you mean by "close" nodes? Set them to drain? I agree that we should not let nodes start up if there is a CPU mismatch that we can detect.

Comment 25 Paul Edmon 2021-01-14 07:54:40 MST

Yeah set them to drain.  Ala what happens for these situations:

[root@holy7c22501 ~]# scontrol show node holy2a10315
NodeName=holy2a10315 Arch=x86_64 CoresPerSocket=16
    CPUAlloc=0 CPUTot=32 CPULoad=0.01
    AvailableFeatures=intel,holyib,skylake,avx,avx2,avx512
    ActiveFeatures=intel,holyib,skylake,avx,avx2,avx512
    Gres=(null)
    NodeAddr=holy2a10315 NodeHostName=holy2a10315 Version=20.11.2
    OS=Linux 3.10.0-1127.18.2.el7.x86_64 #1 SMP Sun Jul 26 15:27:06 UTC 2020
    RealMemory=191898 AllocMem=0 FreeMem=184760 Sockets=2 Boards=1
    MemSpecLimit=4096
    State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=136000 Weight=1 Owner=N/A 
MCS_label=N/A
    Partitions=kozinsky,emergency,serial_requeue
    BootTime=2021-01-12T00:03:03 SlurmdStartTime=2021-01-12T10:01:09
    CfgTRES=cpu=32,mem=191898M,billing=39
    AllocTRES=
    CapWatts=n/a
    CurrentWatts=0 AveWatts=0
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
    Reason=Low RealMemory [slurm@2021-01-13T15:43:05]
    Comment=(null)

You can slurm closed the node due to a memory mismatch between what was 
defined in slurm.conf and what was on the node itself.

I've thrown together a quick script for it:

[root@aagk80gpu22 ~]# cat ./core.sh
#!/bin/bash

score=$(scontrol --oneliner show node $(hostname -s) | grep -o 
'CPUTot=[^ ]*' | cut -d "=" -f 2)
pcore=$(nproc --all)

echo $score
echo $pcore

if [ $score != $pcore ]
then
         echo score "Not Equal" pcore
fi

That I'm going to roll out in our node health checker but it would be 
even better if Slurm did this check natively.

-Paul Edmon-

On 1/13/2021 4:12 PM, bugs@schedmd.com wrote:
>
> *Comment # 23 <https://bugs.schedmd.com/show_bug.cgi?id=10613#c23> on 
> bug 10613 <https://bugs.schedmd.com/show_bug.cgi?id=10613> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> (In reply to Paul Edmon fromcomment #21  <show_bug.cgi?id=10613#c21>)
> > Something I did just notice though is that once I removed > config_overrides it did not complain or close the nodes that had > 
> hyperthreading on but did not have the indicated in the slurm.conf. > 
> That should be fixed.  Slurm should close nodes that have that sort of 
> > mismatch.  I get it for Memory and Disk as underlisting memory isn't 
> > really a problem (and should be maintained as being completely 
> accurate > for that is almost impossible).  However as we just 
> discovered having a > core mismatch is more serious.  Apparently you 
> need to get the number of > cores and threads right else you will hit 
> this problem.  So if there is > a fix to put in to slurm about this, 
> that would be it.  It should close > nodes that do not have the proper 
> number of cores/threads regardless of > it is over or under listing.
> What do you mean by "close" nodes? Set them to drain? I agree that we should
> not let nodes start up if there is a CPU mismatch that we can detect.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 29 Michael Hinton 2021-01-14 11:57:32 MST

Paul,

(In reply to Michael Hinton from comment #20)
> I believe we made a change in 20.11 that inadvertently made this more strict
> (i.e. Slurm stopped believing the config and started looking at the CPUs
> actually on the machine, regardless of config_overrides). We're looking into
> it to see if that change was the right one.
We have concluded that this change I referred to in 20.11 is the source of these task_p_pre_launch errors. That change will be reverted and will hopefully make it into 20.11.3. If you want, I can give you the patch now, or you can wait for it to land (likely within the next day or two) and apply it yourself, or you can just wait for 20.11.3.

With the patch, you can probably go back to using config_overrides, which should help with bug 10631 (though we ultimately recommend syncing the config with the actual hardware and removing config_overrides in production).

Thanks,
-Michael

Comment 30 Paul Edmon 2021-01-14 12:01:18 MST

That's the goal on this end.  I'd like to operate with out 
config_overrides on but with the logs being too chatty it makes it 
difficult to sustain until I can dial that back.

I've manged the close off all the bad nodes on this end and I'm fixing 
them so I won't need a patch.  Frankly I'm finding all sorts of 
misconfigured nodes that I didn't know about so in a way this is a win 
for me.

-Paul Edmon-

On 1/14/2021 1:57 PM, bugs@schedmd.com wrote:
>
> *Comment # 29 <https://bugs.schedmd.com/show_bug.cgi?id=10613#c29> on 
> bug 10613 <https://bugs.schedmd.com/show_bug.cgi?id=10613> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> Paul,
>
> (In reply to Michael Hinton fromcomment #20  <show_bug.cgi?id=10613#c20>)
> > I believe we made a change in 20.11 that inadvertently made this more strict > (i.e. Slurm stopped believing the config and started looking at the 
> CPUs > actually on the machine, regardless of config_overrides). We're 
> looking into > it to see if that change was the right one.
> We have concluded that this change I referred to in 20.11 is the source of
> these task_p_pre_launch errors. That change will be reverted and will hopefully
> make it into 20.11.3. If you want, I can give you the patch now, or you can
> wait for it to land (likely within the next day or two) and apply it yourself,
> or you can just wait for 20.11.3.
>
> With the patch, you can probably go back to using config_overrides, which
> should help withbug 10631  <show_bug.cgi?id=10631>  (though we ultimately recommend syncing the config
> with the actual hardware and removing config_overrides in production).
>
> Thanks,
> -Michael
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 33 Michael Hinton 2021-01-15 12:40:11 MST

Paul,

This has been fixed in commit 25d74cc45f and will make it into 20.11.3. See https://github.com/SchedMD/slurm/commit/25d74cc45fe18d3f2cfce5e7d5a312f9cd8f0b72.

Thanks!
-Michael