Summary: | Advice needed: keeping cores free on sockets for GPUs when using both MaxCPUsPerNode for partitions and Cores= in gres.conf | ||
---|---|---|---|
Product: | Slurm | Reporter: | Christopher Samuel <chris> |
Component: | Scheduling | Assignee: | Marcin Stolarek <cinek> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | 5 - Enhancement | ||
Priority: | --- | CC: | altenkort, bart, cinek, robin.humble+slurm, rsmith, tanderson |
Version: | 17.11.2 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=14015 | ||
Site: | Swinburne | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Christopher Samuel
2018-01-31 22:50:36 MST
Hi Sorry for late respond. Currently we have couple of gres related bugs and I didn't want to give you obsolete answer. For now I don't see any available options to do this work properly. I will try to implement MaxCPUsPerSocket, but this won't be available until 18.08. Dominik On 10/02/18 04:17, bugs@schedmd.com wrote: > Hi Hi Dominik, > Sorry for late respond. > Currently we have couple of gres related bugs and I didn't want to give you > obsolete answer. Not a problem, thanks for getting back to me. > For now I don't see any available options to do this work properly. > I will try to implement MaxCPUsPerSocket, but this won't be available until > 18.08. Understood, best of luck with that work! All the best, Chris On 10/02/18 04:17, bugs@schedmd.com wrote: > For now I don't see any available options to do this work properly. > I will try to implement MaxCPUsPerSocket, but this won't be available until > 18.08. The only alternative I can think of is perhaps a ReserveNumCores= option in gres.conf that means that number of cores can only be used by jobs that request the particular gres. It might be cleaner to keep that in the gres.conf as then you don't need to muck around with multiple partitions & submit filters as you must to currently achieve this. So consider this for a 16 core node with 2 P100 GPUs: Name=gpu Type=p100 File=/dev/nvidia0 Cores=0-7 ReserveNumCores=2 Name=gpu Type=p100 File=/dev/nvidia1 Cores=8-15 ReserveNumCores=2 GPU and non-GPU jobs could use the first 6 cores on either socket, but the last 2 would be kept free for jobs that request the gres. I was thinking that you could have ReserveCores= but then it's ambiguous whether the cores you reserve have to intersect with the Cores= directive or not. Plus if a site didn't care about locality they could drop the Cores= directive and just have: Name=gpu Type=p100 File=/dev/nvidia0 ReserveNumCores=2 Name=gpu Type=p100 File=/dev/nvidia1 ReserveNumCores=2 ...and then the code would just keep 2 cores free for each GPU but not care about topology. I think this might be a better way of doing it as it satisfies what we want without having to grow multiple partitions and thus is transparent to the user. How does that sound? All the best, Chris Hi I will discuss this idea with rest of the team. MaxCPUsPerSocket looks straightforward but it requires changes in API/protocol. This can be done only in 18.08. Dominik On Monday, 12 February 2018 10:12:24 PM AEDT bugs@schedmd.com wrote: > I will discuss this idea with rest of the team. > MaxCPUsPerSocket looks straightforward but it requires changes in > API/protocol. This can be done only in 18.08. Thanks Dominik, appreciate you checking on this and on the fact that this will need to wait for the next major release. Hi After talk with development team we made decision that we won’t add new features to cons_res. We can change severity of this bug to sev-5 bug and we will keep this issue in mind for future design/development. Dominik On Tuesday, 20 February 2018 3:01:41 AM AEDT bugs@schedmd.com wrote: > Hi Hi Dominik > After talk with development team we made decision that > we won’t add new features to cons_res. > > We can change severity of this bug to sev-5 bug and we will keep this issue > in mind for future design/development. Oh that's disappointing. Thanks for letting me know. Chris, Here is a bit more background information. Our current plan it to develop a completely new select/cons_res plugin in 2018. This new version will provide a great deal more support for managing GPUs. We plan to release a beta-version of the plugin in November at SC18 (compatible with version 18.08) and include it in version 19.05 (in parallel with the current select/cons_res plugin). We can include the MaxCPUsPerSocket functionality in that release. On 21/02/18 01:17, bugs@schedmd.com wrote: > Chris, Hi Moe! > Here is a bit more background information. Thanks for that, that sounds a lot more promising! I really appreciate that. All the best, Chris One related question based on what appears to be conflicting information in the manual pages. The manual page for gres.conf says: # If specified, then only the identified cores can be allocated with # each generic resource; an attempt to use other cores will not be honored. but the manual page for sbatch says: # --gres-flags=enforce-binding # If set, the only CPUs available to the job will be those bound # to the selected GRES (i.e. the CPUs identified in the gres.conf # file will be strictly enforced rather than advisory). So are the Cores= spec in gres.conf enforced by default or not? The reason I raise it in this bug is that if it is just advisory by default then we can specify both for now and hopefully get the best of both worlds (until the machine gets fragmented by jobs). All the best, Chris (In reply to Christopher Samuel from comment #18) > One related question based on what appears to be conflicting information in > the manual pages. > > The manual page for gres.conf says: > > > # If specified, then only the identified cores can be allocated with > # each generic resource; an attempt to use other cores will not be honored. > > but the manual page for sbatch says: > > # --gres-flags=enforce-binding > # If set, the only CPUs available to the job will be those bound > # to the selected GRES (i.e. the CPUs identified in the gres.conf > # file will be strictly enforced rather than advisory). > > So are the Cores= spec in gres.conf enforced by default or not? > > The reason I raise it in this bug is that if it is just advisory by default > then we can specify both for now and hopefully get the best of both worlds > (until the machine gets fragmented by jobs). Thanks for reporting this inconsistency. The information in gres.conf is advisory unless the job specifies --gres-flags=enforce-binding which requires matching cores/GPUs. I've just updated the documentation: https://github.com/SchedMD/slurm/commit/f85659560193b918e439356982acf4806930a9bc Hi Moe, On 15/03/18 02:09, bugs@schedmd.com wrote: > Thanks for reporting this inconsistency. > > The information in gres.conf is advisory unless the job specifies > --gres-flags=enforce-binding which requires matching cores/GPUs. I've > just updated the documentation: That's great, thanks so much for the clarification! That helps us a lot, as I was reading it as preventing access from other cores which was the root cause of this bug report (wanting to specify that and wanting to have some reserved cores on a partition and thinking you could end up with reserved cores that couldn't be used due to the gres.conf config). All the best, Chris (In reply to Moe Jette from comment #15) > Chris, > > Here is a bit more background information. > > Our current plan it to develop a completely new select/cons_res plugin in > 2018. This new version will provide a great deal more support for managing > GPUs. We plan to release a beta-version of the plugin in November at SC18 > (compatible with version 18.08) and include it in version 19.05 (in parallel > with the current select/cons_res plugin). We can include the > MaxCPUsPerSocket functionality in that release. I wanted to let you know the MaxCPUsPerSocket has not been implemented yet and will not be available in version 19.05. Sorry... Hello, unfortunately it seems that in 20.02 this is also not yet implemented. Do you have any update on when this feature could become available? It is highly requested by our users. Thanks & best regards Luis Hi Moe and Dominik, as an update, we've been going ok with with MaxCPUsPerNode plus some hackery in job_submit.lua. but now that we have cons_tres, is it possible to revisit the possibility of implementing MaxCPUsPerSocket for a partition or a QOS? or some other solution that would let us keep a few optimal cores free for gpu jobs. we're not tied to MaxCPUsPerSocket - it's just an idea. indeed with nodes that have 2 sockets and 1 gpu, then even MaxCPUsPerSocket wouldn't do the right thing. it almost needs to be a vector eg. MaxCPUsPerSocket=16,18 might keep 2 cores free on socket0 (where the gpu is attached), and 0 free on socket1 (no gpu) or perhaps even more generally (as cpus seems to be increasing in numa complexity) it could be a numa node vector MaxCPUsPerNumaNode=6,8,8,8,8,8,8,8 or something like that. what do you think? cheers, robin |