Summary: | JobCompType content for jobcomp/filetxt vs slurmdbd | ||
---|---|---|---|
Product: | Slurm | Reporter: | Brian Haymore <brian.haymore> |
Component: | Accounting | Assignee: | Moe Jette <jette> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 5 - Enhancement | ||
Priority: | --- | CC: | alex |
Version: | 15.08.10 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | University of Utah | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 17.11.0-pre1 | Target Release: | --- |
DevPrio: | 4 - Medium | Emory-Cloud Sites: | --- |
Description
Brian Haymore
2016-11-01 12:48:52 MDT
Hi Brian. The jobcomp/filetxt plugin currently records these fields: JobId, UserId=name(uid), GroupId=name(gid), Name, JobState, Partition, TimeLimit, StartTime, EndTime, NodeList, NodeCnt, ProcCnt, WorkDir. Other jobcomp plugins (i.e. elasticsearch) records up to 36+ fields, including job_record->gres_req and job_record->gres_alloc which you mentioned to be interested in to be recorded. I can change the bug to a sev-5 (enchancement request) for this petition to widen the jobcomp/filetxt number of fields. Unfortunately our development schedule is pretty full at the moment, and absent an external patch or customer wanting to sponsor some of this priority work I can't say how soon we'd be able to get to it. (If that's something University of Utah may be interested in pursuing let me know and we can have that discussion separately.) So help me understand a bit. Can I just change my setting from jobcmp/filetxt to elasticsearch and have all 36 of those fields land in the same text file I have now? I'm happy to change to a feature request status if there is reason to do so to accomplish this, but if there is an existing knob that does it I'm happy to explore that to. Guide me on this please. -- Brian D. Haymore University of Utah Center for High Performance Computing 155 South 1452 East RM 405 Salt Lake City, Ut 84112 Phone: 801-558-1150, Fax: 801-585-5366 http://bit.ly/1HO1N2C ________________________________ From: bugs@schedmd.com [bugs@schedmd.com] Sent: Wednesday, November 02, 2016 7:12 AM To: Brian Haymore Subject: [Bug 3229] JobCompType content for jobcomp/filetxt vs slurmdbd Alejandro Sanchez<redir.aspx?REF=dGcATzYQ2pPgDLcWHHwf_-JPTDa3SEXDYjQ-ari_RzKiiulMTgPUCAFtYWlsdG86YWxleEBzY2hlZG1kLmNvbQ..> changed bug 3229<redir.aspx?REF=zIC9EIjS5rpybjewYcjVHCcSTqXTlF4ogBh9tjtJkqzJsOlMTgPUCAFodHRwczovL2J1Z3Muc2NoZWRtZC5jb20vc2hvd19idWcuY2dpP2lkPTMyMjk.> What Removed Added CC alex@schedmd.com Assignee support@schedmd.com alex@schedmd.com Comment # 1<redir.aspx?REF=0CJbeahu-47uIBKSFIocrD6CAZjL73T30v3O2xQVMfDJsOlMTgPUCAFodHRwczovL2J1Z3Muc2NoZWRtZC5jb20vc2hvd19idWcuY2dpP2lkPTMyMjkjYzE.> on bug 3229<redir.aspx?REF=zIC9EIjS5rpybjewYcjVHCcSTqXTlF4ogBh9tjtJkqzJsOlMTgPUCAFodHRwczovL2J1Z3Muc2NoZWRtZC5jb20vc2hvd19idWcuY2dpP2lkPTMyMjk.> from Alejandro Sanchez<redir.aspx?REF=10dt2X-4wccnNslTjFrXfqNyr1xiyokQ2BCe7unXq_zw1ulMTgPUCAFtYWlsdG86YWxleEBzY2hlZG1kLmNvbQ..> Hi Brian. The jobcomp/filetxt plugin currently records these fields: JobId, UserId=name(uid), GroupId=name(gid), Name, JobState, Partition, TimeLimit, StartTime, EndTime, NodeList, NodeCnt, ProcCnt, WorkDir. Other jobcomp plugins (i.e. elasticsearch) records up to 36+ fields, including job_record->gres_req and job_record->gres_alloc which you mentioned to be interested in to be recorded. I can change the bug to a sev-5 (enchancement request) for this petition to widen the jobcomp/filetxt number of fields. Unfortunately our development schedule is pretty full at the moment, and absent an external patch or customer wanting to sponsor some of this priority work I can't say how soon we'd be able to get to it. (If that's something University of Utah may be interested in pursuing let me know and we can have that discussion separately.) ________________________________ You are receiving this mail because: * You reported the bug. (In reply to Brian Haymore from comment #2) > So help me understand a bit. Can I just change my setting from > jobcmp/filetxt to elasticsearch and have all 36 of those fields land in the > same text file I have now? No, if you change JobCompType value from jobcomp/filetxt to jobcomp/elasticsearch, the job completion information will be indexed into the elasticsearch server[1] address and port specified by JobCompLoc parameter. > I'm happy to change to a feature request status if there is reason to do so to > accomplish this, but if there is an existing knob that does it I'm happy to > explore that to. Guide me on this please. So if the accounting done by slurmdbd is not enough, and you want to make use of a parallel Job Completion Plugin besides the slurmdbd, there are currently 4 JobCompType plugins available: filetxt, mysql, script and elasticsearch. As I said, the filetxt plugin only stores the fields mentioned in my previous comment. The elasticsearch plugin stores 36+ fields but instead of doing so in a file, it indexes the records in an Elasticsearch server. Anyhow, for most of the customers, doing the accounting with the slurmdbd is more than enough. Also, we've done some tests where slurmdbd scales better than the JobCompType plugins in heavy HTC environments. The elasticsearch plugin stores these 38 fields: jobid, username, user_id, groupname, group_id, @start, @end, elapsed, partition, alloc_node, nodes, total_cpus, total_nodes, derived_exitcode, exitcode, state, cpu_hours, array_job_id, array_task_id, @submit, queue_wait, work_dir, std_err, std_in, std_out, cluster, qos, ntasks, ntasks_per_node, cpus_per_task, orig_dependency, excluded_nodes, time_limit, reservation_name, gres_req, gres_alloc, account, script, parent_accounts. Note: parent_accounts is the account hierarchy from the job account up to the root in the format /root/accountA/subaccountB/.../subaccountC. [1] https://www.elastic.co/products/elasticsearch OK that helps me understand better. At the root of this is that our current setup using slurmdbd has proven to be susceptible things that end up loosing partial or full job records. So the interest in extending the jobcomp/filetxt is much to give us a fall back plan that we can reconstruct missing info in slurmdbd. I'm still not sure if David Richardson has yet opened the ticket with you all on the front of the issues we have observed with slurmdbd yet to see if there are already measures we can take there to improve things so right now it's hard to give an overall opinion. I guess my feeling would be to request that we flag this on as a feature extension request understanding your current queue and load means this is a ways out before it will be looked at. Then between here and then as we look at what David has or will report we can see if we should rethink the feature request. How does that sound to you? -- Brian D. Haymore University of Utah Center for High Performance Computing 155 South 1452 East RM 405 Salt Lake City, Ut 84112 Phone: 801-558-1150, Fax: 801-585-5366 http://bit.ly/1HO1N2C ________________________________ From: bugs@schedmd.com [bugs@schedmd.com] Sent: Wednesday, November 02, 2016 1:28 PM To: Brian Haymore Subject: [Bug 3229] JobCompType content for jobcomp/filetxt vs slurmdbd Comment # 3<redir.aspx?REF=1xafrk5fBrl2wzJp4t9HIGwF_nl31lI0whajmK5OpbvrcYdWVwPUCAFodHRwczovL2J1Z3Muc2NoZWRtZC5jb20vc2hvd19idWcuY2dpP2lkPTMyMjkjYzM.> on bug 3229<redir.aspx?REF=wDvT8Z1cENMOPXmIhDUobAuYXZiiEix-2TH8oKZYKHrrcYdWVwPUCAFodHRwczovL2J1Z3Muc2NoZWRtZC5jb20vc2hvd19idWcuY2dpP2lkPTMyMjk.> from Alejandro Sanchez<redir.aspx?REF=SVwqKulFJkhA1j5sAluwkulSa-JHy4xFo-SucFMusBgRmIdWVwPUCAFtYWlsdG86YWxleEBzY2hlZG1kLmNvbQ..> (In reply to Brian Haymore from comment #2<redir.aspx?REF=S-E5eHNL-UvtUnOs4ZzMsyZIsIbkxuTgWaqjWk7-vbkRmIdWVwPUCAFodHRwczovL2J1Z3Muc2NoZWRtZC5jb20vc2hvd19idWcuY2dpP2lkPTMyMjkjYzI.>) > So help me understand a bit. Can I just change my setting from > jobcmp/filetxt to elasticsearch and have all 36 of those fields land in the > same text file I have now? No, if you change JobCompType value from jobcomp/filetxt to jobcomp/elasticsearch, the job completion information will be indexed into the elasticsearch server[1] address and port specified by JobCompLoc parameter. > I'm happy to change to a feature request status if there is reason to do so to > accomplish this, but if there is an existing knob that does it I'm happy to > explore that to. Guide me on this please. So if the accounting done by slurmdbd is not enough, and you want to make use of a parallel Job Completion Plugin besides the slurmdbd, there are currently 4 JobCompType plugins available: filetxt, mysql, script and elasticsearch. As I said, the filetxt plugin only stores the fields mentioned in my previous comment. The elasticsearch plugin stores 36+ fields but instead of doing so in a file, it indexes the records in an Elasticsearch server. Anyhow, for most of the customers, doing the accounting with the slurmdbd is more than enough. Also, we've done some tests where slurmdbd scales better than the JobCompType plugins in heavy HTC environments. The elasticsearch plugin stores these 38 fields: jobid, username, user_id, groupname, group_id, @start, @end, elapsed, partition, alloc_node, nodes, total_cpus, total_nodes, derived_exitcode, exitcode, state, cpu_hours, array_job_id, array_task_id, @submit, queue_wait, work_dir, std_err, std_in, std_out, cluster, qos, ntasks, ntasks_per_node, cpus_per_task, orig_dependency, excluded_nodes, time_limit, reservation_name, gres_req, gres_alloc, account, script, parent_accounts. Note: parent_accounts is the account hierarchy from the job account up to the root in the format /root/accountA/subaccountB/.../subaccountC. [1] https://www.elastic.co/products/elasticsearch<redir.aspx?REF=4t4IJhEGWf0eweNpI27JXKSTfWOOeR2dpVAWolinKiwRmIdWVwPUCAFodHRwczovL3d3dy5lbGFzdGljLmNvL3Byb2R1Y3RzL2VsYXN0aWNzZWFyY2g.> ________________________________ You are receiving this mail because: * You reported the bug. (In reply to Brian Haymore from comment #4) > OK that helps me understand better. At the root of this is that our > current setup using slurmdbd has proven to be susceptible things that end up > loosing partial or full job records. So the interest in extending the > jobcomp/filetxt is much to give us a fall back plan that we can reconstruct > missing info in slurmdbd. I'm still not sure if David Richardson has yet > opened the ticket with you all on the front of the issues we have observed > with slurmdbd yet to see if there are already measures we can take there to > improve things so right now it's hard to give an overall opinion. I guess > my feeling would be to request that we flag this on as a feature extension > request understanding your current queue and load means this is a ways out > before it will be looked at. Then between here and then as we look at what > David has or will report we can see if we should rethink the feature > request. How does that sound to you? I see 4 bugs from David Richardson related to slurmdbd: Bug 2602 - resolved/infogiven. Bug 2828 - resolved/fixed. Bug 2888 - unconfirmed (we're working on it) Bug 2889 - resolved/infogiven. Don't know if you are experiencing more issues but in any case just open a new bug for that. It sounds good to me, gonna mark this bug as a sev-5. I'd also encourage you to upgrade to the latest 16.05 version, a lot of issues where fixed since 15.08. I've added quite a few fields to the jobcomp/filetxt plugin. It's not every job field, but it seem to be pretty much everything I would imagine would be useful. Here's a list of added fields: ArrayJobId, ArrayTaskId, ReservationName, Gres, Account, QOS, WcKey, Cluster, SubmitTime, EligibleTime, DerivedExitCode and ExitCode. Here's the commit with the change: https://github.com/SchedMD/slurm/commit/41f2c4745929b83e8ef3d3fe577d266aa5b81b0f This will be available in Slurm version 17.11, but should apply cleanly as a patch to version 17.02 if desired. |