Summary: | srun gets io error with slurmstepd | ||
---|---|---|---|
Product: | Slurm | Reporter: | David Bigagli <david> |
Component: | Other | Assignee: | David Bigagli <david> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | da |
Version: | 2.6.x | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | SchedMD | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
David Bigagli
2013-12-06 07:51:08 MST
e need a step with 2 tasks at least each running on a different node. When one task ends srun marks the object as no longer suitable to do be read upon receiving the end of file message on the stdout and stderr for task in question. However the eio object can be still used for write operation, at this stage this is not correct as all tasks on the node in question have exited and slurmstepd as well, indeed the tcp connection is marked as close_wait. The next write operation to the socket still succeeds and the next one again will cause the poll() to report POLLERR, which causes the error in question. The correct behaviour should be to shutdown the eio obj on the srun side after all remote eio objects have terminated. David Fixed commit a52825028. David Done. |