Created attachment 17437 [details] runlog for busted run. We upgraded to 20.11.2 commit bde072c607 on Monday as part of our normal upgrade pattern. The upgrade for the most part has gone well but users have been noticing the following error: [root@holy7c22501 ~]# srun -n 1 --pty -p serial_requeue -w seasmicro26 /bin/bash srun: job 12914079 queued and waiting for resources srun: job 12914079 has been allocated resources slurmstepd: error: Failed to invoke task plugins: task_p_pre_launch error srun: error: seasmicro26: task 0: Exited with exit code 1 It doesn't happen all the time and it only happens on certain nodes. I haven't figured out what is going on. I've attached a debug run for you to look over. Let me know if you have any ideas as to what is happening. -Paul Edmon-
Hi Paul, We've seen this error crop up in 20.11 with a few customers and are trying to get to the bottom of it. Could you attach your current slurm.conf? Thanks, -Michael
Created attachment 17446 [details] slurm.conf
Hey Paul, When you run slurmd in the foreground (slurmd -D), it actually does not pick up all the logs (the stepd logs do not get printed). So could you attach the slurmd.log snippet that corresponds to the busted run? The stepd logs should tell us more information about what caused the task_p_pre_launch error. Thanks, Michael
So the slurm log does show any slurmstepd stuff: Jan 12 14:14:23 seasmicro26 slurmd[31823]: Node configuration differs from hardware: CPUs=12:24(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw) CoresPerSocket=6:$ Jan 12 14:14:23 seasmicro26 slurmd[31823]: Can not stat gres.conf file (/etc/slurm/gres.conf), using slurm.conf data Jan 12 14:14:23 seasmicro26 slurmd[31823]: topology/tree: init: topology tree plugin loaded Jan 12 14:14:23 seasmicro26 slurmd[31823]: route/topology: init: route topology plugin loaded Jan 12 14:14:23 seasmicro26 slurmd[31823]: Resource spec: system cgroup memory limit set to 4096 MB Jan 12 14:14:23 seasmicro26 slurmd[31823]: task/affinity: init: task affinity plugin loaded with CPU mask 0xffffff Jan 12 14:14:23 seasmicro26 slurmd[31823]: cred/munge: init: Munge credential signature plugin loaded Jan 12 14:14:23 seasmicro26 slurmd[31823]: slurmd version 20.11.2 started Jan 12 14:14:23 seasmicro26 slurmd[31823]: slurmd started on Tue, 12 Jan 2021 14:14:23 -0500 Jan 12 14:14:24 seasmicro26 slurmd[31823]: CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1 Memory=128664 TmpDisk=901823 Uptime=51395 CPUSpecList=(null) Feature$ Jan 12 14:14:31 seasmicro26 slurmd[31823]: launch task StepId=12913850.0 request from UID:0 GID:0 HOST:10.31.142.192 PORT:59652 Jan 12 14:14:31 seasmicro26 slurmd[31823]: task/affinity: lllp_distribution: JobId=12913850 implicit auto binding: cores,one_thread, dist 8192 Jan 12 14:14:31 seasmicro26 slurmd[31823]: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic Jan 12 14:14:31 seasmicro26 slurmd[31823]: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [12913850]: mask_cpu,one_thread, 0x001000 Jan 12 14:14:34 seasmicro26 slurm_epilog[31924]: Starting epilog for job 12913850 by root... Jan 12 14:14:34 seasmicro26 slurm_epilog[31929]: Epilog for job 12913850 by root complete. Jan 12 14:14:37 seasmicro26 slurmd[31823]: got shutdown request That's the log for that job. Odds are I don't have the debug flags set correctly. I'm happy to run another test job, just let me know what debug flags you want.
(In reply to Paul Edmon from comment #4) > That's the log for that job. Odds are I don't have the debug flags set > correctly. I'm happy to run another test job, just let me know what debug > flags you want. Hmm, you should see some logs from the step. Set slurmd log level to debug2, restart the slurmd, and reproduce the problem. I'm expecting the slurmd.log to reveal something like this: [2020-12-17T14:21:42.676] [384.0] task_p_pre_launch: Using sched_affinity for tasks [2020-12-17T14:21:42.677] [384.0] sched_setaffinity(18992,128,0x2) failed: Invalid argument [2020-12-17T14:21:42.677] [384.0] debug: task_g_pre_launch: task/affinity: Unspecified error [2020-12-17T14:21:42.677] [384.0] error: Failed to invoke task plugins: task_p_pre_launch error
This is what I get with debug set to debug2. -Paul Edmon- Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug2: _insert_job_state: we already have a job state for job 13106940. No big deal, just an FYI. Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug2: Finish processing RPC: REQUEST_LAUNCH_PROLOG Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug2: Start processing RPC: REQUEST_LAUNCH_TASKS Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug2: Processing RPC: REQUEST_LAUNCH_TASKS Jan 13 13:32:52 seasmicro26 slurmd[9074]: launch task StepId=13106940.0 request from UID:0 GID:0 HOST:10.31.142.192 PORT:41444 Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug: Checking credential with 404 bytes of sig data Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug: task/affinity: task_p_slurmd_launch_request: task affinity : before lllp distribution cpu bind method is '(null type)' ((null)) Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug: task/affinity: lllp_distribution: binding tasks:1 to nodes:0 sockets:0:1 cores:1:0 threads:1 Jan 13 13:32:52 seasmicro26 slurmd[9074]: task/affinity: lllp_distribution: JobId=13106940 implicit auto binding: cores,one_thread, dist 8192 Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug2: task/affinity: lllp_distribution: JobId=13106940 will use lllp_cyclic because of SelectTypeParameters Jan 13 13:32:52 seasmicro26 slurmd[9074]: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic Jan 13 13:32:52 seasmicro26 slurmd[9074]: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [13106940]: mask_cpu,one_thread, 0x002000 Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug: task/affinity: task_p_slurmd_launch_request: task affinity : after lllp distribution cpu bind method is 'mask_cpu,one_thread' (0x002000) Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug: Waiting for job 13106940's prolog to complete Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug: Finished wait for job 13106940's prolog to complete Jan 13 13:32:52 seasmicro26 slurmd[9074]: debug2: Finish processing RPC: REQUEST_LAUNCH_TASKS Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug2: Start processing RPC: REQUEST_TERMINATE_JOB Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug2: Processing RPC: REQUEST_TERMINATE_JOB Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug: _rpc_terminate_job, uid = 57812 JobId=13106940 Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug: credential for job 13106940 revoked Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug2: container signal 999 to StepId=13106940.extern Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug2: container signal 18 to StepId=13106940.extern Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug2: container signal 15 to StepId=13106940.extern Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug2: set revoke expiration for jobid 13106940 to 1610562893 UTS Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug: Waiting for job 13106940's prolog to complete Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug: Finished wait for job 13106940's prolog to complete Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug: [job 13106940] attempting to run epilog [/usr/local/bin/slurm_epilog] Jan 13 13:32:53 seasmicro26 slurm_epilog[9947]: Starting epilog for job 13106940 by root... Jan 13 13:32:53 seasmicro26 slurm_epilog[9951]: Epilog for job 13106940 by root complete. Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug: completed epilog for jobid 13106940 Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug: JobId=13106940: sent epilog complete msg: rc = 0 Jan 13 13:32:53 seasmicro26 slurmd[9074]: debug2: Finish processing RPC: REQUEST_TERMINATE_JOB On 1/13/2021 1:25 PM, bugs@schedmd.com wrote: > > *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=10613#c5> on > bug 10613 <https://bugs.schedmd.com/show_bug.cgi?id=10613> from > Michael Hinton <mailto:hinton@schedmd.com> * > (In reply to Paul Edmon fromcomment #4 <show_bug.cgi?id=10613#c4>) > > That's the log for that job. Odds are I don't have the debug flags set > correctly. I'm happy to run another test job, just let me know what > debug > flags you want. > Hmm, you should see some logs from the step. Set slurmd log level to debug2, > restart the slurmd, and reproduce the problem. I'm expecting the slurmd.log to > reveal something like this: > > [2020-12-17T14:21:42.676] [384.0] task_p_pre_launch: Using sched_affinity for > tasks > [2020-12-17T14:21:42.677] [384.0] sched_setaffinity(18992,128,0x2) failed: > Invalid argument > [2020-12-17T14:21:42.677] [384.0] debug: task_g_pre_launch: task/affinity: > Unspecified error > [2020-12-17T14:21:42.677] [384.0] error: Failed to invoke task plugins: > task_p_pre_launch error > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Where are you getting this log output from? Is this a log aggregated by systemd or something? You should be seeing step logs ([13106940.0] ...), so it seems the log output you are using doesn't include this. You should be accessing the actual slurmd.log file for the node to get the full logs.
We have everything log via syslog so it should end up in journalctl which is then dumped to /var/log/messages. We don't have a separate file for logging. When I pull from journalctl: Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: launch task StepId=13109913.0 request from UID:0 GID:0 HOST:10.31.142.192 PORT:41912 Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug: Checking credential with 404 bytes of sig data Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug: task/affinity: task_p_slurmd_launch_request: task affinity : before lllp distribution cpu bind method is '(null type)' ((null)) Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug: task/affinity: lllp_distribution: binding tasks:1 to nodes:0 sockets:0:1 cores:1:0 threads:1 Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: task/affinity: lllp_distribution: JobId=13109913 implicit auto binding: cores,one_thread, dist 8192 Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug2: task/affinity: lllp_distribution: JobId=13109913 will use lllp_cyclic because of SelectTypeParameters Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [13109913]: mask_cpu,one_thread, 0x001000 Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug: task/affinity: task_p_slurmd_launch_request: task affinity : after lllp distribution cpu bind method is 'mask_cpu,one_thread' (0x001000) Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug: Waiting for job 13109913's prolog to complete Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug: Finished wait for job 13109913's prolog to complete Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug2: Finish processing RPC: REQUEST_LAUNCH_TASKS Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu [23841]: [13109913.0]: pam_unix(slurm:session): session opened for user root by (uid=0) Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu [23841]: [13109913.0]: pam_unix(slurm:session): session closed for user root Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug2: Start processing RPC: REQUEST_TERMINATE_JOB Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug2: Processing RPC: REQUEST_TERMINATE_JOB Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug: _rpc_terminate_job, uid = 57812 JobId=13109913 Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug: credential for job 13109913 revoked Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug2: container signal 999 to StepId=13109913.extern Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug: _step_connect: connect() failed for /var/slurmd/spool/slurmd/seasmicro26_13109913.0: No such file or directory Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug: signal for nonexistent StepId=13109913.0 stepd_connect failed: No such file or directory Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug2: container signal 18 to StepId=13109913.extern Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug2: container signal 15 to StepId=13109913.extern Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug2: set revoke expiration for jobid 13109913 to 1610565745 UTS Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug: Waiting for job 13109913's prolog to complete Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug: Finished wait for job 13109913's prolog to complete Jan 13 14:20:25 seasmicro26.rc.fas.harvard.edu slurmd[23560]: debug: [job 13109913] attempting to run epilog [/usr/local/bin/slurm_epilog] So that contains the step information. -Paul Edmon- On 1/13/2021 2:01 PM, bugs@schedmd.com wrote: > > *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=10613#c7> on > bug 10613 <https://bugs.schedmd.com/show_bug.cgi?id=10613> from > Michael Hinton <mailto:hinton@schedmd.com> * > Where are you getting this log output from? Is this a log aggregated by systemd > or something? You should be seeing step logs ([13106940.0] ...), so it seems > the log output you are using doesn't include this. You should be accessing the > actual slurmd.log file for the node to get the full logs. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Just a comment, it would be awesome if this info could just be part of the normal log stream rather than having to jump through hoops. -Paul Edmon- On 1/13/2021 2:01 PM, bugs@schedmd.com wrote: > > *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=10613#c7> on > bug 10613 <https://bugs.schedmd.com/show_bug.cgi?id=10613> from > Michael Hinton <mailto:hinton@schedmd.com> * > Where are you getting this log output from? Is this a log aggregated by systemd > or something? You should be seeing step logs ([13106940.0] ...), so it seems > the log output you are using doesn't include this. You should be accessing the > actual slurmd.log file for the node to get the full logs. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
(In reply to Paul Edmon from comment #8) > We have everything log via syslog so it should end up in journalctl > which is then dumped to /var/log/messages. We don't have a separate > file for logging. When I pull from journalctl: > ... > So that contains the step information. It seems to me that syslog or whatever is grepping for "slurm". At any rate, these logs do not have the complete stepd log output, so we can't proceed until we have that. Do you know the location of the slurmd.log file for the node the job ran on?
(In reply to Paul Edmon from comment #9) > Just a comment, it would be awesome if this info could just be part of > the normal log stream rather than having to jump through hoops. Feel free to open up an enhancement request. Maybe there is something we can improve in this area that makes sense.
Created attachment 17462 [details] slurmd.log As noted previously we did not have SlurmdLogFile set in our slurm.conf as we assumed that if that wasn't set everything would be routed to syslog via systemd. If that isn't true it should be fixed as that isn't the correct behavior. Anyways I've set: ### Logging SlurmdSyslogDebug=debug2 SlurmdDebug=debug2 SlurmdLogFile=/var/log/slurmd.log In the slurm.conf seasmicro26 and run a job. Here is the result.
Looks like this time I caught the error you were looking for: [2021-01-13T14:38:58.292] [13110853.0] debug: Sending launch resp rc=0 [2021-01-13T14:38:58.293] [13110853.0] debug: mpi type = (null) [2021-01-13T14:38:58.293] [13110853.0] debug: mpi/none: p_mpi_hook_slurmstepd_task: Using mpi/none [2021-01-13T14:38:58.293] [13110853.0] debug: task/affinity: task_p_pre_launch: affinity StepId=13110853.0, task:0 bind:8448 [2021-01-13T14:38:58.294] [13110853.0] task/affinity: slurm_setaffinity: sched_setaffinity(28695,128,0x2) failed: Invalid argument [2021-01-13T14:38:58.294] [13110853.0] debug: task_g_pre_launch: task/affinity: Unspecified error [2021-01-13T14:38:58.294] [13110853.0] error: Failed to invoke task plugins: task_p_pre_launch error
(In reply to Paul Edmon from comment #12) > As noted previously we did not have SlurmdLogFile set in our slurm.conf as > we assumed that if that wasn't set everything would be routed to syslog via > systemd. If that isn't true it should be fixed as that isn't the correct > behavior. Hmm, our docs do say that everything will be logged to syslog. My guess is that there a step between syslog and the journalctld output you shared in comment 8 that greps syslog for "slurm" (maybe this is a bad assumption on the part of journalctld?). If we do not log everything to syslog like we promise, that sounds like a bug. Would you mind opening up a separate ticket for this issue?
Yup, already did: https://bugs.schedmd.com/show_bug.cgi?id=10625 -Paul Edmon- On 1/13/2021 2:56 PM, bugs@schedmd.com wrote: > > *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=10613#c14> on > bug 10613 <https://bugs.schedmd.com/show_bug.cgi?id=10613> from > Michael Hinton <mailto:hinton@schedmd.com> * > (In reply to Paul Edmon fromcomment #12 <show_bug.cgi?id=10613#c12>) > > As noted previously we did not have SlurmdLogFile set in our slurm.conf as > we assumed that if that wasn't set everything would be routed to > syslog via > systemd. If that isn't true it should be fixed as that > isn't the correct > behavior. > Hmm, our docs do say that everything will be logged to syslog. My guess is that > there a step between syslog and the journalctld output you shared incomment 8 <show_bug.cgi?id=10613#c8> > that greps syslog for "slurm" (maybe this is a bad assumption on the part of > journalctld?). > > If we do not log everything to syslog like we promise, that sounds like a bug. > Would you mind opening up a separate ticket for this issue? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
(In reply to Paul Edmon from comment #13) > Looks like this time I caught the error you were looking for: > > [2021-01-13T14:38:58.292] [13110853.0] debug: Sending launch resp rc=0 > [2021-01-13T14:38:58.293] [13110853.0] debug: mpi type = (null) > [2021-01-13T14:38:58.293] [13110853.0] debug: mpi/none: > p_mpi_hook_slurmstepd_task: Using mpi/none > [2021-01-13T14:38:58.293] [13110853.0] debug: task/affinity: > task_p_pre_launch: affinity StepId=13110853.0, task:0 bind:8448 > [2021-01-13T14:38:58.294] [13110853.0] task/affinity: slurm_setaffinity: > sched_setaffinity(28695,128,0x2) failed: Invalid argument > [2021-01-13T14:38:58.294] [13110853.0] debug: task_g_pre_launch: > task/affinity: Unspecified error > [2021-01-13T14:38:58.294] [13110853.0] error: Failed to invoke task plugins: > task_p_pre_launch error Excellent, thank you. Basically, this error is saying that the task/affinity plugin is trying to bind (sched_setaffinity()) to CPU 1 (CPU mask 0x2) but is unable to, likely because it isn't in Slurm's cpuset cgroup. We have another site that ran into this issue, and for them the issue was that an HPC file system called Weka created a cgroup that reserved CPU 1, preventing Slurm from using it. On the node in question, could you please run this command and post the output?: find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \; Thanks
Interesting. Here is the result: [root@seasmicro26 ~]# find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \; /sys/fs/cgroup/cpuset/slurm/cpuset.cpus 0-23 /sys/fs/cgroup/cpuset/cpuset.cpus 0-23 On 1/13/2021 3:03 PM, bugs@schedmd.com wrote: > > *Comment # 16 <https://bugs.schedmd.com/show_bug.cgi?id=10613#c16> on > bug 10613 <https://bugs.schedmd.com/show_bug.cgi?id=10613> from > Michael Hinton <mailto:hinton@schedmd.com> * > (In reply to Paul Edmon fromcomment #13 <show_bug.cgi?id=10613#c13>) > > Looks like this time I caught the error you were looking for: > > [2021-01-13T14:38:58.292] [13110853.0] debug: Sending launch resp > rc=0 > [2021-01-13T14:38:58.293] [13110853.0] debug: mpi type = (null) > > [2021-01-13T14:38:58.293] [13110853.0] debug: mpi/none: > > p_mpi_hook_slurmstepd_task: Using mpi/none > [2021-01-13T14:38:58.293] > [13110853.0] debug: task/affinity: > task_p_pre_launch: affinity > StepId=13110853.0, task:0 bind:8448 > [2021-01-13T14:38:58.294] > [13110853.0] task/affinity: slurm_setaffinity: > > sched_setaffinity(28695,128,0x2) failed: Invalid argument > > [2021-01-13T14:38:58.294] [13110853.0] debug: task_g_pre_launch: > > task/affinity: Unspecified error > [2021-01-13T14:38:58.294] > [13110853.0] error: Failed to invoke task plugins: > task_p_pre_launch > error > Excellent, thank you. Basically, this error is saying that the task/affinity > plugin is trying to bind (sched_setaffinity()) to CPU 1 (CPU mask 0x2) but is > unable to, likely because it isn't in Slurm's cpuset cgroup. > > We have another site that ran into this issue, and for them the issue was that > an HPC file system called Weka created a cgroup that reserved CPU 1, preventing > Slurm from using it. > > On the node in question, could you please run this command and post the > output?: > > find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \; > > Thanks > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Something interesting I noticed. This host is defined in slurm.conf as not having hyperthreading on: [root@seasmicro26 ~]# scontrol show node seasmicro26 NodeName=seasmicro26 Arch=x86_64 CoresPerSocket=6 CPUAlloc=0 CPUTot=12 CPULoad=0.03 AvailableFeatures=intel,haswell,avx,avx2 ActiveFeatures=intel,haswell,avx,avx2 Gres=(null) NodeAddr=seasmicro26 NodeHostName=seasmicro26 Version=20.11.2 OS=Linux 3.10.0-1127.18.2.el7.x86_64 #1 SMP Sun Jul 26 15:27:06 UTC 2020 RealMemory=128666 AllocMem=0 FreeMem=118873 Sockets=2 Boards=1 MemSpecLimit=4096 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=869376 Weight=1 Owner=N/A MCS_label=N/A Partitions=fdoyle_compute,emergency,serial_requeue BootTime=2021-01-11T23:57:49 SlurmdStartTime=2021-01-13T14:38:34 CfgTRES=cpu=12,mem=128666M,billing=21 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=plugin issues [root@2021-01-13T14:41:34] Comment=(null) Yet that CPU set and this lscpu show it on: [root@seasmicro26 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz Stepping: 2 CPU MHz: 1200.000 CPU max MHz: 3200.0000 CPU min MHz: 1200.0000 BogoMIPS: 4800.22 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0-5,12-17 NUMA node1 CPU(s): 6-11,18-23 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer xsave avx f16c rdrand lahf_lm abm epb invpcid_single intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d [root@seasmicro26 ~]# ^C I wonder if that is the source of the problem. -Paul Edmon- On 1/13/2021 3:03 PM, bugs@schedmd.com wrote: > > *Comment # 16 <https://bugs.schedmd.com/show_bug.cgi?id=10613#c16> on > bug 10613 <https://bugs.schedmd.com/show_bug.cgi?id=10613> from > Michael Hinton <mailto:hinton@schedmd.com> * > (In reply to Paul Edmon fromcomment #13 <show_bug.cgi?id=10613#c13>) > > Looks like this time I caught the error you were looking for: > > [2021-01-13T14:38:58.292] [13110853.0] debug: Sending launch resp > rc=0 > [2021-01-13T14:38:58.293] [13110853.0] debug: mpi type = (null) > > [2021-01-13T14:38:58.293] [13110853.0] debug: mpi/none: > > p_mpi_hook_slurmstepd_task: Using mpi/none > [2021-01-13T14:38:58.293] > [13110853.0] debug: task/affinity: > task_p_pre_launch: affinity > StepId=13110853.0, task:0 bind:8448 > [2021-01-13T14:38:58.294] > [13110853.0] task/affinity: slurm_setaffinity: > > sched_setaffinity(28695,128,0x2) failed: Invalid argument > > [2021-01-13T14:38:58.294] [13110853.0] debug: task_g_pre_launch: > > task/affinity: Unspecified error > [2021-01-13T14:38:58.294] > [13110853.0] error: Failed to invoke task plugins: > task_p_pre_launch > error > Excellent, thank you. Basically, this error is saying that the task/affinity > plugin is trying to bind (sched_setaffinity()) to CPU 1 (CPU mask 0x2) but is > unable to, likely because it isn't in Slurm's cpuset cgroup. > > We have another site that ran into this issue, and for them the issue was that > an HPC file system called Weka created a cgroup that reserved CPU 1, preventing > Slurm from using it. > > On the node in question, could you please run this command and post the > output?: > > find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \; > > Thanks > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
I was right. The mismatch in the slurm.conf to the actual config caused the problem. When I turned off HT and then reran: [root@holy7c22501 ~]# srun -n 1 --pty -p serial_requeue -w seasmicro26 /bin/bash srun: Required node not available (down, drained or reserved) srun: job 13116974 queued and waiting for resources srun: job 13116974 has been allocated resources [root@seasmicro26 ~]# exit exit Now it works correctly. This is the downside of going with config_override. I will have to get rid of that. Anyways what this means is that I need to track down all the nodes that hyperthreading on and turn that off.
(In reply to Paul Edmon from comment #19) > I was right. The mismatch in the slurm.conf to the actual config caused the > problem. When I turned off HT and then reran: > > [root@holy7c22501 ~]# srun -n 1 --pty -p serial_requeue -w seasmicro26 > /bin/bash > srun: Required node not available (down, drained or reserved) > srun: job 13116974 queued and waiting for resources > srun: job 13116974 has been allocated resources > [root@seasmicro26 ~]# exit > exit > > Now it works correctly. This is the downside of going with config_override. > I will have to get rid of that. Anyways what this means is that I need to > track down all the nodes that hyperthreading on and turn that off. Excellent! I think you are correct. I believe we made a change in 20.11 that inadvertently made this more strict (i.e. Slurm stopped believing the config and started looking at the CPUs actually on the machine, regardless of config_overrides). We're looking into it to see if that change was the right one. However, would you say that the outcome here is ultimately positive? Or did you not want Slurm to be aware of hyperthreading? We usually do not recommend running with config_overrides in production, since it can mask hardware misconfigurations like this. So I'm glad you are moving away from that.
My intention was to actually remove config_overrides sooner but it was too chatty in the logs so I didn't. It has now been demoted in terms of severity so it doesn't spam our logs anymore. Something I did just notice though is that once I removed config_overrides it did not complain or close the nodes that had hyperthreading on but did not have the indicated in the slurm.conf. That should be fixed. Slurm should close nodes that have that sort of mismatch. I get it for Memory and Disk as underlisting memory isn't really a problem (and should be maintained as being completely accurate for that is almost impossible). However as we just discovered having a core mismatch is more serious. Apparently you need to get the number of cores and threads right else you will hit this problem. So if there is a fix to put in to slurm about this, that would be it. It should close nodes that do not have the proper number of cores/threads regardless of it is over or under listing. -Paul Edmon- On 1/13/2021 3:45 PM, bugs@schedmd.com wrote: > Michael Hinton <mailto:hinton@schedmd.com> changed bug 10613 > <https://bugs.schedmd.com/show_bug.cgi?id=10613> > What Removed Added > CC felip.moll@schedmd.com > > *Comment # 20 <https://bugs.schedmd.com/show_bug.cgi?id=10613#c20> on > bug 10613 <https://bugs.schedmd.com/show_bug.cgi?id=10613> from > Michael Hinton <mailto:hinton@schedmd.com> * > (In reply to Paul Edmon fromcomment #19 <show_bug.cgi?id=10613#c19>) > > I was right. The mismatch in the slurm.conf to the actual config caused the > problem. When I turned off HT and then reran: > > [root@holy7c22501 > ~]# srun -n 1 --pty -p serial_requeue -w seasmicro26 > /bin/bash > > srun: Required node not available (down, drained or reserved) > srun: > job 13116974 queued and waiting for resources > srun: job 13116974 has > been allocated resources > [root@seasmicro26 ~]# exit > exit > > Now > it works correctly. This is the downside of going with > config_override. > I will have to get rid of that. Anyways what this > means is that I need to > track down all the nodes that hyperthreading > on and turn that off. > Excellent! I think you are correct. > > I believe we made a change in 20.11 that inadvertently made this more strict > (i.e. Slurm stopped believing the config and started looking at the CPUs > actually on the machine, regardless of config_overrides). We're looking into it > to see if that change was the right one. However, would you say that the > outcome here is ultimately positive? Or did you not want Slurm to be aware of > hyperthreading? > > We usually do not recommend running with config_overrides in production, since > it can mask hardware misconfigurations like this. So I'm glad you are moving > away from that. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
(In reply to Paul Edmon from comment #21) > Something I did just notice though is that once I removed > config_overrides it did not complain or close the nodes that had > hyperthreading on but did not have the indicated in the slurm.conf. > That should be fixed. Slurm should close nodes that have that sort of > mismatch. I get it for Memory and Disk as underlisting memory isn't > really a problem (and should be maintained as being completely accurate > for that is almost impossible). However as we just discovered having a > core mismatch is more serious. Apparently you need to get the number of > cores and threads right else you will hit this problem. So if there is > a fix to put in to slurm about this, that would be it. It should close > nodes that do not have the proper number of cores/threads regardless of > it is over or under listing. What do you mean by "close" nodes? Set them to drain? I agree that we should not let nodes start up if there is a CPU mismatch that we can detect.
Yeah set them to drain. Ala what happens for these situations: [root@holy7c22501 ~]# scontrol show node holy2a10315 NodeName=holy2a10315 Arch=x86_64 CoresPerSocket=16 CPUAlloc=0 CPUTot=32 CPULoad=0.01 AvailableFeatures=intel,holyib,skylake,avx,avx2,avx512 ActiveFeatures=intel,holyib,skylake,avx,avx2,avx512 Gres=(null) NodeAddr=holy2a10315 NodeHostName=holy2a10315 Version=20.11.2 OS=Linux 3.10.0-1127.18.2.el7.x86_64 #1 SMP Sun Jul 26 15:27:06 UTC 2020 RealMemory=191898 AllocMem=0 FreeMem=184760 Sockets=2 Boards=1 MemSpecLimit=4096 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=136000 Weight=1 Owner=N/A MCS_label=N/A Partitions=kozinsky,emergency,serial_requeue BootTime=2021-01-12T00:03:03 SlurmdStartTime=2021-01-12T10:01:09 CfgTRES=cpu=32,mem=191898M,billing=39 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low RealMemory [slurm@2021-01-13T15:43:05] Comment=(null) You can slurm closed the node due to a memory mismatch between what was defined in slurm.conf and what was on the node itself. I've thrown together a quick script for it: [root@aagk80gpu22 ~]# cat ./core.sh #!/bin/bash score=$(scontrol --oneliner show node $(hostname -s) | grep -o 'CPUTot=[^ ]*' | cut -d "=" -f 2) pcore=$(nproc --all) echo $score echo $pcore if [ $score != $pcore ] then echo score "Not Equal" pcore fi That I'm going to roll out in our node health checker but it would be even better if Slurm did this check natively. -Paul Edmon- On 1/13/2021 4:12 PM, bugs@schedmd.com wrote: > > *Comment # 23 <https://bugs.schedmd.com/show_bug.cgi?id=10613#c23> on > bug 10613 <https://bugs.schedmd.com/show_bug.cgi?id=10613> from > Michael Hinton <mailto:hinton@schedmd.com> * > (In reply to Paul Edmon fromcomment #21 <show_bug.cgi?id=10613#c21>) > > Something I did just notice though is that once I removed > config_overrides it did not complain or close the nodes that had > > hyperthreading on but did not have the indicated in the slurm.conf. > > That should be fixed. Slurm should close nodes that have that sort of > > mismatch. I get it for Memory and Disk as underlisting memory isn't > > really a problem (and should be maintained as being completely > accurate > for that is almost impossible). However as we just > discovered having a > core mismatch is more serious. Apparently you > need to get the number of > cores and threads right else you will hit > this problem. So if there is > a fix to put in to slurm about this, > that would be it. It should close > nodes that do not have the proper > number of cores/threads regardless of > it is over or under listing. > What do you mean by "close" nodes? Set them to drain? I agree that we should > not let nodes start up if there is a CPU mismatch that we can detect. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Paul, (In reply to Michael Hinton from comment #20) > I believe we made a change in 20.11 that inadvertently made this more strict > (i.e. Slurm stopped believing the config and started looking at the CPUs > actually on the machine, regardless of config_overrides). We're looking into > it to see if that change was the right one. We have concluded that this change I referred to in 20.11 is the source of these task_p_pre_launch errors. That change will be reverted and will hopefully make it into 20.11.3. If you want, I can give you the patch now, or you can wait for it to land (likely within the next day or two) and apply it yourself, or you can just wait for 20.11.3. With the patch, you can probably go back to using config_overrides, which should help with bug 10631 (though we ultimately recommend syncing the config with the actual hardware and removing config_overrides in production). Thanks, -Michael
That's the goal on this end. I'd like to operate with out config_overrides on but with the logs being too chatty it makes it difficult to sustain until I can dial that back. I've manged the close off all the bad nodes on this end and I'm fixing them so I won't need a patch. Frankly I'm finding all sorts of misconfigured nodes that I didn't know about so in a way this is a win for me. -Paul Edmon- On 1/14/2021 1:57 PM, bugs@schedmd.com wrote: > > *Comment # 29 <https://bugs.schedmd.com/show_bug.cgi?id=10613#c29> on > bug 10613 <https://bugs.schedmd.com/show_bug.cgi?id=10613> from > Michael Hinton <mailto:hinton@schedmd.com> * > Paul, > > (In reply to Michael Hinton fromcomment #20 <show_bug.cgi?id=10613#c20>) > > I believe we made a change in 20.11 that inadvertently made this more strict > (i.e. Slurm stopped believing the config and started looking at the > CPUs > actually on the machine, regardless of config_overrides). We're > looking into > it to see if that change was the right one. > We have concluded that this change I referred to in 20.11 is the source of > these task_p_pre_launch errors. That change will be reverted and will hopefully > make it into 20.11.3. If you want, I can give you the patch now, or you can > wait for it to land (likely within the next day or two) and apply it yourself, > or you can just wait for 20.11.3. > > With the patch, you can probably go back to using config_overrides, which > should help withbug 10631 <show_bug.cgi?id=10631> (though we ultimately recommend syncing the config > with the actual hardware and removing config_overrides in production). > > Thanks, > -Michael > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Paul, This has been fixed in commit 25d74cc45f and will make it into 20.11.3. See https://github.com/SchedMD/slurm/commit/25d74cc45fe18d3f2cfce5e7d5a312f9cd8f0b72. Thanks! -Michael