Ticket 15925 - slurmctld core dump: srun --ntasks=1 + MinNodes>1
Summary: slurmctld core dump: srun --ntasks=1 + MinNodes>1
Status: RESOLVED DUPLICATE of ticket 15857
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 22.05.5
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Chad Vizino
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-02-01 03:43 MST by Pascal Schuhmacher
Modified: 2023-02-08 16:46 MST (History)
1 user (show)

See Also:
Site: KIT
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Pascal Schuhmacher 2023-02-01 03:43:30 MST
Hello,

The slurmctld on our cluster crashed with the following error:

traps: srvcn[821278] trap divide error ip:4d5336 sp:7f4de13d2840 error:0 in slurmctld[400000+110000]
Started Process Core Dump (PID 821279/UID 0).
Process 2683604 (slurmctld) of user 973 dumped core.
systemd-coredump@1-821279-0.service: Succeeded.

addr2line -e /usr/sbin/slurmctld 4d5336
/root/rpmbuild/BUILD/slurm-22.05.5/src/slurmctld/step_mgr.c:2121

2121:	cpus_per_task = cpu_cnt / task_cnt;

The slurm version is slurm 22.05.5

I was able to reproduce it with the following job submit:

[zs0402@uccn998 slurm]$ srun --partition=multiple --ntasks=1 --pty bash
srun: job 20716 queued and waiting for resources
srun: job 20716 has been allocated resources
srun: error: Unable to create step for job 20716: Zero Bytes were transmitted or received

The Partition multiple is configured as follows:
PartitionName=multiple OverSubscribe=EXCLUSIVE Nodes=uccn[458-460] DefMemPerCPU=1125 MaxMemPerNode=90000 MaxCPUsPerNode=80 DefaultTime=30 Maxtime=48:00:00 MinNodes=2 MaxNodes=4

I think the problem is the --ntasks=1 via srun in conjuncture with the MinNodes=2

This seems like a bug to me. Is it already fixed in newer versions?

Best Regards,
Pascal
Comment 2 Chad Vizino 2023-02-01 11:45:35 MST
Hi. I am looking into this.
Comment 3 Chad Vizino 2023-02-07 13:11:45 MST
Just an update: The division you note in step_mgr.c:2121 looks like it's going to be fixed by a proposed patch for bug 15857 in an upcoming release. I'm keeping my eye on that one to see how it lands and will provide an update on this ticket after that to see if there's any work left to be done with a fix.
Comment 4 Chad Vizino 2023-02-08 16:46:49 MST
(In reply to Chad Vizino from comment #3)
> Just an update: The division you note in step_mgr.c:2121 looks like it's
> going to be fixed by a proposed patch for bug 15857 in an upcoming release.
> I'm keeping my eye on that one to see how it lands and will provide an
> update on this ticket after that to see if there's any work left to be done
> with a fix.
The fix for that one was just checked in to the master branch:

>https://github.com/schedmd/slurm/commit/aee252a45e

So I'll close this ticket for now as a partial duplicate of bug 15857.

*** This ticket has been marked as a duplicate of ticket 15857 ***