7287 – sacct state is not updating for jobs that have dispatched

Ticket 7287 - sacct state is not updating for jobs that have dispatched

Summary: sacct state is not updating for jobs that have dispatched

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Accounting (show other tickets)
Version:	18.08.6
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Marshall Garey
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-06-25 10:32 MDT by Jenny Williams
Modified:	2019-07-05 14:44 MDT (History)
CC List:	0 users

See Also:	7155 7010 6880
Site:	University of North Carolina at Chapel Hill
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurmctld log (11.02 MB, application/x-compressed) 2019-06-25 11:21 MDT, Jenny Williams	Details
slurmdbd log (993.92 KB, application/x-compressed) 2019-06-25 11:21 MDT, Jenny Williams	Details
slurmdbd - add invalid reservation instead of giving up (854 bytes, patch) 2019-06-25 12:00 MDT, Marshall Garey	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Jenny Williams 2019-06-25 10:32:22 MDT

sacct state is not updating for many jobs that have dispatched, even after waiting some period of time. Sacct listing for jobs shows as pending after the job has dispatched, or even completed.



# date; squeue -j 25845876; sacct -j 25845876
Tue Jun 25 12:21:21 EDT 2019

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          25845876   general myjob.sl dbcortes  R       1:08      1 c0828

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
25845876       myjob.sl    general rc_akshau+          1    PENDING      0:0



# date; squeue -j 25845876; sleep 3; sacct -j 25845876
Tue Jun 25 12:21:38 EDT 2019

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          25845876   general myjob.sl dbcortes  R       1:24      1 c0828

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
25845876       myjob.sl    general rc_akshau+          1    PENDING      0:0

Comment 1 Jenny Williams 2019-06-25 10:40:32 MDT

after 15 minutes sacct has still not updated


# date; sacct -j 25845876;  squeue -j 25845876
Tue Jun 25 12:39:47 EDT 2019

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
25845876       myjob.sl    general rc_akshau+          1    PENDING      0:0

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          25845876   general myjob.sl dbcortes  R      19:33      1 c0828

Comment 2 Marshall Garey 2019-06-25 10:42:06 MDT

(I wrote this before your comment 1 update)

Can you upload the output of sdiag? Specifically, we're looking for the value of DBD Agent queue size. If this is growing and not decreasing, this is a sign of a bigger problem. If DBD Agent queue size is zero or close to zero, then check the output of sacct again for jobs that have started. It takes some non-zero amount of time for the information to get to the database. How long it takes varies widely depending on hardware setup, innodb configuration, Slurm configuration, how busy the system is, etc.


(Post comment 1)

This is a bit more concerning. Please upload the output of sdiag and upload your slurmdbd and slurmctld log files for the last few hours.

Comment 3 Jenny Williams 2019-06-25 11:10:13 MDT

should i close partitions.

I am also seeing jobs terminating for time elapsed that only ran moments -

# sacct -j 25801569 --format=timelimit,elapsed,nodelist
 Timelimit    Elapsed        NodeList
---------- ---------- ---------------
1-00:00:00 1-00:00:10           t0601
           1-00:00:12           t0601
           1-00:00:12           t0601

[2019-06-24T10:56:02.206] _run_prolog: prolog with lock for job 25801569 ran for 0 seconds
[2019-06-24T10:56:02.237] [25801569.extern] task/cgroup: /slurm/uid_275171/job_25801569: alloc=307200MB mem.limit=307200MB memsw.limit=unlimited
[2019-06-24T10:56:02.249] [25801569.extern] task/cgroup: /slurm/uid_275171/job_25801569/step_extern: alloc=307200MB mem.limit=307200MB memsw.limit=unlimited
[2019-06-24T10:56:04.664] launch task 25801569.0 request from UID:275171 GID:100 HOST:172.26.113.2 PORT:54925
[2019-06-24T10:56:04.686] [25801569.0] task/cgroup: /slurm/uid_275171/job_25801569: alloc=307200MB mem.limit=307200MB memsw.limit=unlimited
[2019-06-24T10:56:04.700] [25801569.0] task/cgroup: /slurm/uid_275171/job_25801569/step_0: alloc=307200MB mem.limit=307200MB memsw.limit=unlimited
[2019-06-24T10:56:04.729] [25801569.0] in _window_manager
[2019-06-25T10:56:12.967] [25801569.0] error: *** STEP 25801569.0 ON t0601 CANCELLED AT 2019-06-25T10:56:12 DUE TO TIME LIMIT ***
[2019-06-25T10:56:13.132] [25801569.extern] _oom_event_monitor: oom-kill event count: 1
[2019-06-25T10:56:14.334] [25801569.0] done with job
[2019-06-25T10:56:14.334] [25801569.extern] done with job

Comment 4 Marshall Garey 2019-06-25 11:14:56 MDT

First upload the sdiag output and the slurmctld and slurmdbd log files, and a current slurm.conf. I'd like to focus on the database issue for the moment, but I'll also look for potential other problems in the logs.

Comment 5 Jenny Williams 2019-06-25 11:21:04 MDT

Created attachment 10700 [details]
slurmctld log

Comment 6 Jenny Williams 2019-06-25 11:21:41 MDT

Created attachment 10701 [details]
slurmdbd log

Comment 7 Marshall Garey 2019-06-25 11:27:06 MDT

slurmdbd.log has:

[2019-06-25T13:14:15.353] error: There is no reservation by id 150, time_start 1561474890, and cluster 'longleaf'

We've seen the issue in the past. This can cause lost database records if not fixed. I'll look for the other bugs with this issue and come back with a way to get around this.

Please upload the output of sdiag.

Comment 8 Jenny Williams 2019-06-25 11:28:02 MDT

]# sdiag
*******************************************************
sdiag output at Tue Jun 25 13:10:36 2019 (1561482636)
Data since      Tue Jun 25 11:13:09 2019 (1561475589)
*******************************************************
Server thread count:  3
Agent queue size:     0
Agent count:          0
DBD Agent queue size: 74785

Jobs submitted: 15654
Jobs started:   11298
Jobs completed: 10295
Jobs canceled:  356
Jobs failed:    0

Job states ts:  Tue Jun 25 13:10:20 2019 (1561482620)
Jobs pending:   12078
Jobs running:   7298

Main schedule statistics (microseconds):
        Last cycle:   32184
        Max cycle:    305196
        Total cycles: 420
        Mean cycle:   65003
        Mean depth cycle:  1048
        Cycles per minute: 3
        Last queue length: 9305

Backfilling stats
        Total backfilled jobs (since last slurm start): 5568
        Total backfilled jobs (since last stats cycle start): 5568
        Total backfilled heterogeneous job components: 0
        Total cycles: 115
        Last cycle when: Tue Jun 25 13:10:02 2019 (1561482602)
        Last cycle: 20061760
        Max cycle:  20210859
        Mean cycle: 19558606
        Last depth cycle: 368
        Last depth cycle (try sched): 368
        Depth Mean: 2502
        Depth Mean (try depth): 475
        Last queue length: 9349
        Queue length mean: 8770

Latency for gettimeofday() (x1000): 15 microseconds

Remote Procedure Call statistics by message type
        REQUEST_PARTITION_INFO                  ( 2009) count:15738  ave_time:125    total_time:1969355
        REQUEST_UPDATE_JOB                      ( 3001) count:14159  ave_time:4058   total_time:57459883
        REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:13459  ave_time:18931  total_time:254800919
        REQUEST_STEP_COMPLETE                   ( 5016) count:11355  ave_time:50720  total_time:575932246
        MESSAGE_EPILOG_COMPLETE                 ( 6012) count:11334  ave_time:108111 total_time:1225333677
        REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:11302  ave_time:172254 total_time:1946822114
        REQUEST_COMPLETE_PROLOG                 ( 6018) count:11298  ave_time:118845 total_time:1342720014
        REQUEST_FED_INFO                        ( 2049) count:7713   ave_time:92     total_time:713246
        REQUEST_JOB_USER_INFO                   ( 2039) count:7568   ave_time:179424 total_time:1357886133
        REQUEST_NODE_INFO                       ( 2007) count:7482   ave_time:51497  total_time:385303984
        REQUEST_JOB_INFO                        ( 2003) count:639    ave_time:92941  total_time:59389860
        MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:394    ave_time:81152  total_time:31974181
        REQUEST_KILL_JOB                        ( 5032) count:371    ave_time:10293  total_time:3818810
        REQUEST_JOB_INFO_SINGLE                 ( 2021) count:144    ave_time:96857  total_time:13947501
        REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:84     ave_time:109883 total_time:9230173
        REQUEST_STATS_INFO                      ( 2035) count:63     ave_time:97     total_time:6136
        REQUEST_JOB_READY                       ( 4019) count:54     ave_time:18593  total_time:1004062
        REQUEST_RESOURCE_ALLOCATION             ( 4001) count:48     ave_time:222277 total_time:10669328
        REQUEST_RESERVATION_INFO                ( 2024) count:31     ave_time:132192 total_time:4097961
        REQUEST_UPDATE_PARTITION                ( 3005) count:31     ave_time:26442  total_time:819720
        REQUEST_JOB_ALLOCATION_INFO             ( 4014) count:25     ave_time:104509 total_time:2612727
        REQUEST_JOB_STEP_CREATE                 ( 5001) count:24     ave_time:283    total_time:6801
        REQUEST_CANCEL_JOB_STEP                 ( 5005) count:16     ave_time:57433  total_time:918942
        REQUEST_SHARE_INFO                      ( 2022) count:12     ave_time:1453   total_time:17445
        REQUEST_NODE_INFO_SINGLE                ( 2040) count:7      ave_time:97634  total_time:683444
        REQUEST_JOB_PACK_ALLOC_INFO             ( 4027) count:2      ave_time:92     total_time:185
        REQUEST_UPDATE_NODE                     ( 3002) count:2      ave_time:516    total_time:1032
        REQUEST_PING                            ( 1008) count:1      ave_time:57     total_time:57
        REQUEST_BUILD_INFO                      ( 2001) count:1      ave_time:385    total_time:385
        REQUEST_CREATE_RESERVATION              ( 3006) count:1      ave_time:48550  total_time:48550

Remote Procedure Call statistics by user
        root            (       0) count:63687  ave_time:82333  total_time:5243577792
        hepperla        (  214234) count:9676   ave_time:17485  total_time:169190111
        rujin           (  236960) count:7039   ave_time:58996  total_time:415273906
        zabbix          (   57384) count:6310   ave_time:26776  total_time:168962503
        jitong          (  236926) count:6203   ave_time:22984  total_time:142571774
        kamoser         (  302527) count:5971   ave_time:110991 total_time:662728612
        markreed        (   84588) count:3910   ave_time:23155  total_time:90537503
        slurmmon        (     979) count:3525   ave_time:42683  total_time:150458575
        nhanph          (  282984) count:1450   ave_time:13104  total_time:19000953
        erovenko        (  290553) count:1118   ave_time:1251   total_time:1399509
        bpk             (  238366) count:1087   ave_time:33276  total_time:36171979
        pllittle        (  139508) count:563    ave_time:41185  total_time:23187538
        fenqiang        (  303630) count:411    ave_time:9388   total_time:3858493
        vicbucha        (  244261) count:215    ave_time:18056  total_time:3882152
        jackmag         (  238144) count:212    ave_time:47815  total_time:10136853
        nil1            (  297111) count:142    ave_time:49405  total_time:7015605
        danliang        (  238535) count:116    ave_time:57715  total_time:6694959
        jraab           (  209642) count:96     ave_time:64674  total_time:6208706
        mizhou          (  239103) count:95     ave_time:28214  total_time:2680387
        huihanc         (  288136) count:89     ave_time:122310 total_time:10885618
        jmsimon         (  158261) count:70     ave_time:19912  total_time:1393905
        mashner         (  295352) count:70     ave_time:70842  total_time:4958944
        rc_swans.svc    (  238383) count:57     ave_time:180238 total_time:10273578
        zmozent         (  256972) count:56     ave_time:75511  total_time:4228658
        wuys            (  301497) count:54     ave_time:122758 total_time:6628982
        jxenakis        (  189218) count:53     ave_time:41432  total_time:2195913
        weitangc        (  275171) count:51     ave_time:56547  total_time:2883909
        swijesi         (  302521) count:50     ave_time:188310 total_time:9415526
        tongqiu         (  259043) count:47     ave_time:71161  total_time:3344576
        dyang           (  145983) count:43     ave_time:156766 total_time:6740943
        weifangl        (  296908) count:42     ave_time:137466 total_time:5773607
        kalaineb        (  302600) count:42     ave_time:57631  total_time:2420510
        dbcortes        (  259981) count:42     ave_time:40417  total_time:1697537
        ble             (  190370) count:38     ave_time:82527  total_time:3136041
        chanijo         (  236728) count:35     ave_time:89127  total_time:3119448
        zubatyuk        (  305224) count:34     ave_time:19122  total_time:650151
        jiawei          (  238237) count:33     ave_time:1440   total_time:47528
        hfalan          (  256369) count:31     ave_time:73944  total_time:2292271
        esdavis         (  217058) count:30     ave_time:35248  total_time:1057447
        cseppan         (   83739) count:24     ave_time:131726 total_time:3161431
        sahoon          (  285205) count:24     ave_time:61618  total_time:1478833
        arterca         (  239277) count:23     ave_time:77949  total_time:1792836
        kyoung13        (  208524) count:23     ave_time:83041  total_time:1909965
        noahs           (  108340) count:23     ave_time:56853  total_time:1307626
        roachjm         (   84937) count:22     ave_time:33241  total_time:731312
        migraff         (  173391) count:22     ave_time:3186   total_time:70092
        milove          (  272499) count:21     ave_time:50760  total_time:1065969
        jcwong          (  238779) count:21     ave_time:162326 total_time:3408848
        jblocher        (   32545) count:20     ave_time:82787  total_time:1655754
        bekahb          (  244677) count:20     ave_time:54204  total_time:1084097
        zln3            (  247310) count:19     ave_time:55693  total_time:1058177
        hannaho         (  312325) count:19     ave_time:54108  total_time:1028065
        ranekj          (  297151) count:18     ave_time:65697  total_time:1182555
        pzivich         (  262033) count:17     ave_time:108814 total_time:1849839
        meichen         (  236962) count:17     ave_time:137117 total_time:2331004
        anju            (  302408) count:16     ave_time:11261  total_time:180182
        btemple         (   45943) count:14     ave_time:170885 total_time:2392400
        awella          (  247585) count:12     ave_time:77689  total_time:932268
        psioda          (  199710) count:11     ave_time:8724   total_time:95964
        schauer         (   50562) count:11     ave_time:40051  total_time:440565
        hubenxia        (  315580) count:10     ave_time:61669  total_time:616696
        wancen          (  294745) count:10     ave_time:70027  total_time:700272
        mshvets         (  282607) count:9      ave_time:96341  total_time:867073
        guoshili        (  238620) count:9      ave_time:11706  total_time:105361
        benb3           (  260674) count:8      ave_time:4525   total_time:36206
        liuwei1         (  282044) count:8      ave_time:277614 total_time:2220912
        tptacek         (  287668) count:7      ave_time:3493   total_time:24455
        bsmelton        (  238256) count:7      ave_time:2622   total_time:18358
        evanbure        (  227460) count:7      ave_time:128816 total_time:901715
        halina          (  309107) count:6      ave_time:206    total_time:1237
        zengsi          (  297128) count:6      ave_time:1712   total_time:10275
        ccornea         (  253025) count:6      ave_time:122958 total_time:737752
        rc_htsfdata.svc (  238384) count:6      ave_time:1657   total_time:9947
        esh             (  238456) count:6      ave_time:133    total_time:802
        nanam           (  310488) count:5      ave_time:27283  total_time:136415
        znasipak        (  238001) count:5      ave_time:13287  total_time:66439
        anqizhu         (  227868) count:4      ave_time:252069 total_time:1008279
        pfsulliv        (   84298) count:4      ave_time:133408 total_time:533634
        angelvv         (  252144) count:4      ave_time:153390 total_time:613561
        deep            (   22492) count:3      ave_time:216    total_time:648
        cmaillie        (  317505) count:3      ave_time:643    total_time:1929
        anessa1         (  240331) count:3      ave_time:325    total_time:977
        trhenry         (  213418) count:3      ave_time:809    total_time:2428
        hunycho         (  258380) count:3      ave_time:327    total_time:983
        methan          (  257945) count:3      ave_time:61322  total_time:183967
        mikepab         (  238053) count:3      ave_time:577    total_time:1732
        tigran          (  258377) count:3      ave_time:331592 total_time:994776
        beibo           (  238167) count:3      ave_time:336874 total_time:1010624
        mwaldron        (  151063) count:3      ave_time:198100 total_time:594301
        baldoni         (  236917) count:3      ave_time:247    total_time:743
        bryce38         (  278179) count:2      ave_time:219016 total_time:438033
        nfb             (  237549) count:2      ave_time:194    total_time:389
        stonerm         (  227901) count:1      ave_time:19003  total_time:19003
        mwfoster        (  135399) count:1      ave_time:36424  total_time:36424
        sgonell         (  285710) count:1      ave_time:15629  total_time:15629
        kojiya          (  301812) count:1      ave_time:436112 total_time:436112

Pending RPC statistics
        No pending RPCs

Comment 9 Jenny Williams 2019-06-25 11:31:23 MDT

I had already restarted each of slurmctld and slurmdbd in that order with about 10 min in between before sending the bug report.

Comment 12 Marshall Garey 2019-06-25 11:42:34 MDT

Restarting the daemons doesn't help. You'll need to manually modify the database. We will have a way to automatically fix this issue in Slurm 20.02. Instructions to manually modify the database are in bug 2741 comment 11:



First, please down the partitions (scontrol update partitionname=<name> state=down or set the states in slurm.conf and reconfigure) so that no more jobs will run and so you don't lose accounting records.

Then, backup your database.

I'll give you updated instructions for 18.08 in a bit

Comment 13 Jenny Williams 2019-06-25 11:55:42 MDT

partitions down. backing up things now.

Comment 14 Marshall Garey 2019-06-25 12:00:37 MDT

Created attachment 10705 [details]
slurmdbd - add invalid reservation instead of giving up

Jenny,

Can you please apply this patch, rebuild, and restart slurmdbd? Rather than you having to manually modify the database, this patch will insert the invalid reservation for you, then the DBD Agent queue size should begin to go down. Can you let us know if the DBD Agent queue size is decreasing? Also, can you watch in the slurmdbd log for similar messages to "There is no reservation by id?" This patch should fix those.

I just applied this patch to 18.08.6, so it should build for you. Let me know if there's a problem building.

thanks,

Marshall

Comment 15 Jenny Williams 2019-06-25 12:58:33 MDT

Things appear much improved

is the other issue i reported where timelimit was reported as the exit reason also related ?

Comment 16 Marshall Garey 2019-06-25 13:23:07 MDT

Okay, good. If the DBD agent queue is going down, please bring your partitions back up. Can you let me know when your partitions are running again and the DBD agent queue is down to 0 (or close to 0) and stays there? I'll change the severity back to a sev-4 when that happens, but want to keep it at sev-2 until we know it's running.

About the timelimit issue - I think the information is accurate. Looking closely, I see the job started on June 24th and was killed almost exactly one day later on June 25th. No worries if you missed that detail - I missed it the first time, too.

slurmctld logs:

[2019-06-24T10:56:02.111] sched: Allocate JobId=25801569 NodeList=t0601 #CPUs=1 Partition=bigmem
[2019-06-25T10:56:12.835] Time limit exhausted for JobId=25801569

the slurmd/slurmstepd logs you uploaded:

[2019-06-24T10:56:02.206] _run_prolog: prolog with lock for job 25801569 ran for 0 seconds
[2019-06-25T10:56:12.967] [25801569.0] error: *** STEP 25801569.0 ON t0601 CANCELLED AT 2019-06-25T10:56:12 DUE TO TIME LIMIT ***

I assume that was just an oversight (not seeing the change in date), and it was correctly killed for exceeding its time limit of one day. Do you agree?

Comment 17 Jenny Williams 2019-06-25 13:24:35 MDT

DBD Agent queue size: 0

Queues back up and jobs dispatching

Comment 18 Jenny Williams 2019-06-25 13:25:26 MDT

You are correct! i didn't see the date stamp. Thanks!

Comment 19 Marshall Garey 2019-06-25 13:32:49 MDT

Awesome. In that case, I'm changing this bug back to a sev-4.

The patch I provided you was taken from a SchedMD internal bug. It is currently pending review for QA, so it may not be the final fix. It probably won't go into 18.08; I'm not sure about 19.05. Since it is working, please keep this patch in place for now. It's somewhat of a band-aid patch in that it doesn't fix the underlying cause, but does repair the database by replacing the missing reservation record. We're also investigating what might cause this situation in the first place.

Can you let me know if you see logs in slurmdbd like this?

"Creating reservation by id #, time_start #, and cluster '<name>' instead of modifying."


(This log is in the patch I gave you.)

		info("%s: Creating reservation by id %u, time_start %ld, and cluster '%s' instead of modifying.",


Also, thanks for the prompt uploads of logs, sdiag, etc. I'm glad we got this fixed before you lost database records.

Comment 20 Jenny Williams 2019-06-25 13:52:27 MDT

[2019-06-25T14:51:40.281] as_mysql_modify_resv: Creating reservation by id 150, time_start 1561474890, and cluster 'longleaf' instead of modifying.

Comment 23 Marshall Garey 2019-06-25 14:01:16 MDT

How many times did you see that log message? Is it still happening?

Comment 24 Jenny Williams 2019-06-25 14:02:10 MDT

ONe time

Comment 25 Marshall Garey 2019-06-25 14:03:14 MDT

Okay, I hoped to see it one time - the time it got fixed. If it happens again with the same or another reservation id, please let me know.

Comment 28 Marshall Garey 2019-06-25 14:25:48 MDT

I'm investigating what might have caused this bug. I noticed:

> slurmctld restart:
> 
> [2019-06-25T11:12:19.760] slurmctld version 18.08.6-2 started on cluster
> longleaf
> 
> 
> slurmdbd errors right after:
> 
> [2019-06-25T11:13:39.069] error: There is no reservation by id 150,
> time_start 1561474890, and cluster 'longleaf'

In comment 9 you said:
> I had already restarted each of slurmctld and slurmdbd in that order with
> about 10 min in between before sending the bug report.

Do you recall if this was happening before you restarted slurmctld, or only after? I'm wondering if a restart of slurmctld may be part of the problem.

The slurmctld has these logs:

[2019-06-25T11:10:49.331] Terminate signal (SIGINT or SIGTERM) received
[2019-06-25T11:12:19.760] error: chdir(/pine/EX/root/slurm-log): Permission denied
[2019-06-25T11:12:19.760] slurmctld version 18.08.6-2 started on cluster longleaf

I suspect an unclean shutdown of slurmctld may be a problem, and we have a couple of internal bugs about that. What is StateSaveLocation in slurm.conf? In our records for UNC, it is /var/spool/slurmd. Is that still correct? What kind of filesystem (network, local?) and storage (SSD, HDD?) is that?



I also noticed that the DBD doesn't have any messages between June 21 and June 25:

[2019-06-21T08:46:27.095] error: We have more time than is possible (46483200+748800+0)(47232000) > 46483200 for cluster longleaf(12912) from 2019-06-10T08:00:00 - 2019-06-10T09:00:00 tres 5
[2019-06-25T11:13:39.069] error: There is no reservation by id 150, time_start 1561474890, and cluster 'longleaf'

Were there perhaps missing message that didn't make it into the slurmdbd log that you uploaded? Or is that it?

Comment 30 Marshall Garey 2019-07-02 09:23:39 MDT

I'd just like to follow up on my last comment - when you have a chance, can you respond to those questions?

Comment 31 Jenny Williams 2019-07-02 09:35:31 MDT

Q:Do you recall if this was happening before you restarted slurmctld, or only after? I'm wondering if a restart of slurmctld may be part of the problem.

A: The restarts were an attempt to resolve the issue, thinking the processes might have been in an awkward state - restarts were after the problem was spotted.


Q:I suspect an unclean shutdown of slurmctld may be a problem, and we have a couple of internal bugs about that. What is StateSaveLocation in slurm.conf? In our records for UNC, it is /var/spool/slurmd. Is that still correct? What kind of filesystem (network, local?) and storage (SSD, HDD?) is that?

A:StateSaveLocation       = /pine/EX/root/slurm-log/slurmctld
This was done because local var was too small to allow for debugging to ever be used given our volume of job throughput at times.
The error is expected - we have the settings for slurm controlled by salt -- there is a base salt formula for slurm, then exceptions are in a salt pillar.  That message only occurs when slurm.cfg is edited via salt and the changes pushed out, which then triggers a slurmctld restart. This has been the case since that StateSaveLocation was changed which was many months ago.

Q: I also noticed that the DBD doesn't have any messages between June 21 and June 25:

A: Well, that's disappointing.

Comment 32 Marshall Garey 2019-07-05 14:44:17 MDT

Thanks for the additional information. Since we already have other internal tickets tracking this bug, I'm going to close this bug as resolved/infogiven. The patch I gave you is waiting to be reviewed and hasn't gone upstream yet - please see bug 7155 for updates on that. I don't know what version of Slurm it will be in, but am confident that it won't be in 18.08, so please continue running with this local patch as long as you're running 18.08.

Comment 33 Jenny Williams 2019-07-05 14:44:34 MDT

I am out of the office. If you have questions that require a reply before then please email research@unc.edu.
 Regards,
Virginia Wililams
Systems Administrator
Research Computing
UNC Chapel Hill