Created attachment 16938 [details] slurmd log Probably something missed in the configuration for 20.11.0 but there are a couple of oddities that are worth sending your way. Lauching a job sees it start to run but then do nothing and being reported as JOBID USER ACCOUNT NAME EXEC_HOST ST REASON START_TIME END_TIME TIME_LEFT NODES PRIORITY 10779 kbuckley pawsey0001 boinc-kbuckley nid00013 R Prolog 10:58:06 11:58:06 59:41 1 10056 After some time, the job enters a CG state which it never leaves JOBID USER ACCOUNT NAME EXEC_HOST ST REASON START_TIME END_TIME TIME_LEFT NODES PRIORITY 10779 kbuckley pawsey0001 boinc-kbuckley nid00013 CG Prolog 11:27:39 12:27:39 1:00:00 1 10076 One of the oddities seen in the log (have it running at debug5) is [2020-12-03T10:58:06.401] [10779.extern] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0-39' for '/sys/fs/cgroup/cpuset/slurm/uid_20480/job_10779/step_ext ern' [2020-12-03T10:58:06.401] [10779.extern] error: _file_write_content: unable to open '/sys/fs/cgroup/cpuset/slurm/uid_20480/job_10779/step_extern/expected_usage_in_byte s' for writing: No such file or directory [2020-12-03T10:58:06.401] [10779.extern] debug2: xcgroup_set_param: unable to set parameter 'expected_usage_in_bytes' to '62914560000' for '/sys/fs/cgroup/cpuset/slurm /uid_20480/job_10779/step_extern' although, whilst the job hangs around, we see that that file is in existence nid00013:~ # ls -o /sys/fs/cgroup/cpuset/slurm/uid_20480/job_10779/step_extern/ total 0 -rw-r--r-- 1 root 0 Dec 3 11:04 cgroup.clone_children -rw-r--r-- 1 root 0 Dec 3 10:58 cgroup.procs -rw-r--r-- 1 root 0 Dec 3 11:04 cpuset.cpu_exclusive -rw-r--r-- 1 root 0 Dec 3 10:58 cpuset.cpus -r--r--r-- 1 root 0 Dec 3 11:04 cpuset.effective_cpus -r--r--r-- 1 root 0 Dec 3 11:04 cpuset.effective_mems -rw-r--r-- 1 root 0 Dec 3 11:04 cpuset.expected_usage_in_bytes -rw-r--r-- 1 root 0 Dec 3 11:04 cpuset.mem_exclusive -rw-r--r-- 1 root 0 Dec 3 11:04 cpuset.mem_hardwall -rw-r--r-- 1 root 0 Dec 3 11:04 cpuset.memory_migrate -r--r--r-- 1 root 0 Dec 3 11:04 cpuset.memory_pressure -rw-r--r-- 1 root 0 Dec 3 11:04 cpuset.memory_spread_page -rw-r--r-- 1 root 0 Dec 3 11:04 cpuset.memory_spread_slab -rw-r--r-- 1 root 0 Dec 3 10:58 cpuset.mems -rw-r--r-- 1 root 0 Dec 3 11:04 cpuset.sched_load_balance -rw-r--r-- 1 root 0 Dec 3 11:04 cpuset.sched_relax_domain_level -rw-r--r-- 1 root 0 Dec 3 10:58 notify_on_release -rw-r--r-- 1 root 0 Dec 3 11:04 tasks Full log from the node and config file attached. FWIW, this was our 20.02.5 config with two changes -- Remove SallocDefaultCommand option. -- The acct_gather_energy/cray_aries plugin has been renamed to acct_gather_energy/pm_counters. made based on the content of the 20.11.0 RELEASE_NOTES. Sure I must have missed something: would be good to know what ! Kevin
Created attachment 16939 [details] slurm.conf
Created attachment 16940 [details] Patch from schedmd bug 8473 as applied to 20.11.0 Nearly forgot to say that on our TestDevSystem, we have been applying the patch associated with schedmd bug 8473, however because of the change in the g_job struct, we modified that patch so as to take account of the changes. Patch as applied is therefore attached. FYI, the AE in the filename is our very own Andrew Elwell, who wanted to try stuff out that requires the patch.
There's a suggestion that we might be seeing the same issue as described in #10275. I will deploy that patch and see what happens.
Still seeing jobs enter a Reason:Prolog state.
Went back to backing out the 8473 patch for the InfluxDB stuff and just applying the patch from 10275. Have also altered AcctGatherProfileType=acct_gather_profile/influx back to AcctGatherProfileType=acct_gather_profile/none This now appears to be working. Suggestion is that the mods I made to the 8473 patch for the InfluxDB stuff were correct enough to allow for a compilation but may have buggered something else. Interested to hear your thoughts, especialy as to the "correctness" of the patch attached earlier. Been a few too many variables in all of this to say that that's where the blame lies, but at least we have a 20.11.0 that other folk here can have a play with.
Hi Kevin - I am marking this as a duplicate of bug#10275. In regard to your patch, I ask that you keep that conversation going through bug#8473 by attaching your patch there or opening a new bug with it so that we can keep the flow of contributions away from support issues. In regard to your comments about the cgroups: This xcgroup_set_param is incorrect in more recent versions of cgroup, the file is "cpuset.expected_usage_in_bytes". Before it seems it was called without the prefix "cpuset.", see https://bugs.schedmd.com/show_bug.cgi?id=3154#c15 task_cgroup_cpuset.c #ifdef HAVE_NATIVE_CRAY | /* | * on Cray systems, set the expected usage in bytes. | * This is used by the Cray OOM killer | */ | snprintf(expected_usage, sizeof(expected_usage), "%"PRIu64, | (uint64_t)job->step_mem * 1024 * 1024); | xcgroup_set_param(&step_cpuset_cg, "expected_usage_in_bytes", | expected_usage); | #endif We could do a couple of tries, first with cpuset.expected_usage_in_bytes and if this doesn't work try with the shortest form, but that too would need its own bug report so that we do not muddle this bug up with too many disjointed issues. *** This ticket has been marked as a duplicate of ticket 10275 ***