Bug 544 - srun gets io error with slurmstepd
Summary: srun gets io error with slurmstepd
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other bugs)
Version: 2.6.x
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2013-12-06 07:51 MST by David Bigagli
Modified: 2013-12-13 09:36 MST (History)
1 user (show)

See Also:
Site: SchedMD
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Bigagli 2013-12-06 07:51:08 MST
Sometimes srun shows this error:

srun: error: step_launch_notify_io_failure: aborting, io error with slurmstepd on node 1

this is the sequence of events:

$->srun -w arturo1,arturo2,arturo3,arturo4 cat
srun: WARNING: MessageTimeout is too high for effective fault-tolerance
srun: David _read_io_init_msg Validated IO connection from 127.0.0.1, node rank 0, sd=20
srun: David _read_io_init_msg Validated IO connection from 127.0.0.1, node rank 3, sd=15
srun: David _read_io_init_msg Validated IO connection from 127.0.0.1, node rank 2, sd=16
srun: David _read_io_init_msg Validated IO connection from 127.0.0.1, node rank 1, sd=17
aa
srun: David _server_read entering node_id 3 fd 15
srun: David _server_read entering node_id 2 fd 16
srun: David _server_read entering node_id 1 fd 17
aa
srun: David _server_read entering node_id 0 fd 20
aa
aa
aa
srun: David _server_read entering node_id 3 fd 15
srun: David _server_read got eof-stdout msg on _server_read header
srun: David _handle_msg received task exit
srun: David _exit_handler entering
srun: David _exit_handler task 3 done
srun: David _task_finish received task exit for 1 task (status=0x000f) host arturo4.
srun: error: arturo4: task 3: Terminated
srun: David _exit_handler pthread cond broadcast
srun: David _exit_handler exiting
srun: David slurm_step_launch_wait_finish awake exited 1 requested 4
srun: David _server_read entering node_id 3 fd 15
srun: David _server_read got eof-stderr msg on _server_read header
zz
srun: David _server_read entering node_id 0 fd 20
srun: David _server_read entering node_id 1 fd 17
zz
srun: David _server_read entering node_id 2 fd 16
zz
zz
srun: David _server_read entering node_id 2 fd 16
srun: David _server_read got eof-stdout msg on _server_read header
srun: David _handle_msg received task exit
srun: David _exit_handler entering
srun: David _exit_handler task 2 done
srun: David _task_finish received task exit for 1 task (status=0x000f) host arturo3.
srun: error: arturo3: task 2: Terminated
srun: David _exit_handler pthread cond broadcast
srun: David _exit_handler exiting
srun: David slurm_step_launch_wait_finish awake exited 2 requested 4
srun: David _server_read entering node_id 2 fd 16
srun: David _server_read got eof-stderr msg on _server_read header
xx
srun: David _server_read entering node_id 3 fd 15
srun: error: step_launch_notify_io_failure: aborting, io error with slurmstepd on node 3
^^^^^^^^^^^^^^^^^^ i/o with node_id 3 brought back to life
srun: David _server_read entering node_id 0 fd 20
xx
srun: David _server_read entering node_id 1 fd 17
xx
srun: David _server_read entering node_id 1 fd 17
srun: David _server_read got eof-stdout msg on _server_read header
srun: David _handle_msg received task exit
srun: David _exit_handler entering
srun: David _exit_handler task 1 done
srun: David _task_finish received task exit for 1 task (status=0x000f) host arturo2.
srun: error: arturo2: task 1: Terminated
srun: David _exit_handler pthread cond broadcast
srun: David _exit_handler exiting
srun: David slurm_step_launch_wait_finish awake exited 3 requested 4
srun: David _server_read entering node_id 1 fd 17
srun: David _server_read got eof-stderr msg on _server_read header
bb
srun: David _server_read entering node_id 2 fd 16
srun: error: step_launch_notify_io_failure: aborting, io error with slurmstepd on node 2
^^^^^^^^^^^^^^^^ i/o with node_id 2 brought back to life
srun: David _server_read entering node_id 0 fd 20
bb
srun: David _server_read entering node_id 1 fd 17
srun: error: step_launch_notify_io_failure: aborting, io error with slurmstepd on node 1
srun: David _server_read entering node_id 0 fd 20
srun: David _server_read got eof-stdout msg on _server_read header
srun: David _handle_msg received task exit
srun: David _exit_handler entering
srun: David _exit_handler task 0 done
srun: David _task_finish received task exit for 1 task (status=0x0000) host arturo1.
srun: David _exit_handler pthread cond broadcast
srun: David _exit_handler exiting
srun: David slurm_step_launch_wait_finish awake exited 4 requested 4
srun: David slurm_step_launch_wait_finish done...
srun: Force Terminated job step 11172.0
srun: David srun
srun: David fini_srun 
srun: David slurm_complete_job REQUEST_COMPLETE_JOB_ALLOCATION

David
Comment 1 David Bigagli 2013-12-12 10:25:29 MST
e need a step with 2 tasks at least each running on a different node.
When one task ends srun marks the object as no longer suitable to do be
read upon receiving the end of file message on the stdout and stderr for
task in question. However the eio object can be still used for write
operation, at this stage this is not correct as all tasks on the node in
question have exited and slurmstepd as well, indeed the tcp connection is
marked as close_wait. The next write operation to the socket still succeeds
and the next one again will cause the poll() to report POLLERR, which causes
the error in question.

The correct behaviour should be to shutdown the eio obj on the srun
side after all remote eio objects have terminated.

David
Comment 2 David Bigagli 2013-12-13 09:34:12 MST
Fixed commit a52825028.

David
Comment 3 David Bigagli 2013-12-13 09:36:08 MST
Done.