Ticket 377 - sh5series, a program to extract all values of one item from one series for all samples on all nodes
Summary: sh5series, a program to extract all values of one item from one series for al...
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 2.6.x
Hardware: Linux Linux
: --- 5 - Enhancement
Assignee: Moe Jette
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2013-07-22 05:03 MDT by Rod Schultz
Modified: 2013-08-02 08:54 MDT (History)
3 users (show)

See Also:
Site: Atos/Eviden Sites
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Patch to implement sh5series (32.98 KB, application/octet-stream)
2013-07-22 05:03 MDT, Rod Schultz
Details
Output file ffrom sh5series (9.45 KB, text/plain)
2013-07-22 05:06 MDT, Rod Schultz
Details
Move series extraction to sh5util (26.36 KB, patch)
2013-07-23 03:03 MDT, Rod Schultz
Details | Diff
Patch to remove empty output file (842 bytes, patch)
2013-07-25 02:24 MDT, Rod Schultz
Details | Diff
ItemExtract with tasks (31.69 KB, patch)
2013-07-26 05:00 MDT, Rod Schultz
Details | Diff
Fix seg fault on series name and use hdf5 labels for data names (38.86 KB, patch)
2013-07-29 05:32 MDT, Rod Schultz
Details | Diff
sh5util - rework info,debug,error,fatal (53.95 KB, patch)
2013-07-30 03:14 MDT, Rod Schultz
Details | Diff
Fix merge (53.96 KB, patch)
2013-07-30 04:26 MDT, Rod Schultz
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Rod Schultz 2013-07-22 05:03:35 MDT
Created attachment 351 [details]
Patch to implement sh5series

Dresden requested that we provide the maximum amount of energy used by a job from the profile data.

This is hard to get from the from the HDF5 file, and is even difficult from extracting data from the csv file.

The problem with the csv file is that when you get a series from all the nodes, the data is written serially in the csv file, so you have to manually cut and paste in a spreadsheet to align all the samples from all the nodes. This is maybe tolerable for a one time investigation but is increasingly difficult with jobs running on many nodes.

The extract feature of sh5util is more like a raw file dump.

I have implemented a new program, sh5series that extracts one data item from one series from all samples in all nodes and writes the values to a cvs file. In addition, for each sample in the series it computes the min, ave, max, accumulated value, and identifies the node on which the min and max occurred. 

Finally, it reports on the command line the max value from all samples.
For example, 

Sh5series –series=Energy –data=power –j 42 
     [scotty] (slurm) job> sh5series --series=Energy --data=power -j 26
     sh5series: Extracting 'power' from 'Energy' data from ./job_26.h5 into ./Energy_power_26.csv
         Step 0 Maximum accumlated power Value (2898.000000) occurred at 2013-06-29 12:46:51 (Elapsed Time=576) Ave Node 181.125000

Also attached is a sample csv file as well as a patch implementing this program.
(It looks like I can only do one attachment at a time. I'll follow up with the csv file.)
Comment 1 Rod Schultz 2013-07-22 05:06:39 MDT
Created attachment 352 [details]
Output file ffrom sh5series
Comment 2 Danny Auble 2013-07-22 05:33:16 MDT
Instead of a new program, is there a reason this can't be added to sh5util?
Comment 3 Danny Auble 2013-07-22 06:02:05 MDT
Please don't reply directly to this email as it will not get logged in the system.

So I have a few more questions on the matter.

What exactly are they looking for?  It appears your process does 2 things, 1 make a cvs file and 2 report the max data.

From the requirement it sounds like they only want the latter, perhaps they want both though.

I haven't looked at the code yet, but I would expect reading in the the h5 file and looking for the information would be easy to do.  What you say is the opposite though.

I think I would need more information for what they asked and what they really want.  I am fairly confident we can come up with commandline arguments that make sense in sh5util instead of making a new program.
Comment 4 Rod Schultz 2013-07-22 07:34:22 MDT
I'll try and resist the natural instinct to replay to email so my comments can be tracked. I also just lost 30 minutes of work responding to the above message because the content was lost when I went to review the sh5util man page.

You are right, Dresen did only ask for one thing, but they are happy we delivered two. In my experience when you get the first data point you have just started the conversation. Then you want to know if it was an outlier, if was it a one time peak or part of a long duration, etc. Providing the entire series in csv form means that inside a spread sheet you can quickly sort on any column, and do a wide variety of other things.

The HDFViewer is useful for looking at one time series but is not so useful at looking at more than one time series simultaneously. 

I first started with the csv file we currently export and found it is useful if you want to look at all the data from one node, but not very useful if you want to do correlations between nodes. (In that respect, it has some of the same limitation as HDFView.)

You can't just open the hdf5 file and look for stuff. You have to navigate through the structure. This means you have to know the structure. You have to open (and close) groups. You have to create memory and file objects to retrieve the data in a structure, and then you can look at it. This program and the original extract feature are attempts to let a user get data for further analysis without having to program hdf5.

Maybe we could define another mode to sh5util to provide this table. It could called something like –ItemExtract. It would require the –series parameter and an new –data one.
Comment 5 Rod Schultz 2013-07-23 03:03:02 MDT
Created attachment 353 [details]
Move series extraction to sh5util

I've move the functionality to extract one item from one series from all samples of all nodes to sh5util.

I created a new option -I (--item-extract)
It takes arguments --series and an new one --data

For example,

sh5util -j 42 -I -s Energy -d power

The attached patch to 2.6 entirely replaces the previous implementation.
Comment 6 David Bigagli 2013-07-24 12:14:15 MDT
Hi Rod, this is David. I would like to ask you couple of questions about 
the patch to make sure I understand it correctly.

1) In order to extract the data from a series the individual files must 
   be merge into one. Is this correct?

   The man page says:

-I, --item-extract
        Instead of merging node-step files into a job file  extract  one
        data  item from all samples of one data series from all nodes in
        a job file.
 
  but when I tried to run the code without creating the merged file first 
  I got errors:

>sh5util -I -s Task-1 -d CPU -j 4439
sh5util: Extracting 'CPU' from 'Task-1' data from ./job_4439.h5 into ./Task-1_CPU_4439.csv

HDF5-DIAG: Error detected in HDF5 (1.8.5-patch1) thread 0:
  #000: ../../src/H5F.c line 1509 in H5Fopen(): unable to open file
    major: File accessability
    minor: Unable to open file

 the first message seems to tell me I have to merge first.

2) I merged and run it again:

sh5util -I -s Task-1 -d RSS -j 4441
sh5util: Extracting 'RSS' from 'Task-1' data from ./job_4441.h5 into ./Task-1_RSS_4441.csv

sh5util: No values RSS for series Task-1 found in step 0
sh5util: No values RSS for series Task-1 found in step 1

 however in the hdfview viewer I see RSS used by all tasks. 
 Is my syntax right?

 Upon this failure empty files are left behind:

-rw-rw-r-- 1 david david      0 Jul 24 17:11 Task-1_RSS_4441.csv

 Should we not create those in case of error or if the data is not found?

Thanks,
        David
Comment 7 Rod Schultz 2013-07-25 00:40:24 MDT
David,

The -I option works against a merged job file.

The name of the task series has underscore "_: not dash "-"

You are right, we should not leave an empty file.

Do you want me to make a patch for that?
Comment 8 Rod Schultz 2013-07-25 02:24:18 MDT
Created attachment 356 [details]
Patch to remove empty output file

David,

All modes of sh5util can potentially leave an empty file.

The attached patch checks for an empty file at the end and deletes it.
Comment 9 David Bigagli 2013-07-25 04:49:35 MDT
Rod,
    could you please che the item 2) of my previous message.

If I specify: sh5util -I -s Task-1 -d RSS -j 4441
an empty file gets created while I know the RSS data are there 
because I see them with the hdfview.

 David
Comment 10 Rod Schultz 2013-07-25 05:24:00 MDT
Try 

sh5util -I -s Task_1 -d RSS -j 4441

with "_" instead of with "-" 

sh5util -I -s Task-1 -d RSS -j 4441
Comment 11 Rod Schultz 2013-07-25 05:50:09 MDT
David,

I just had a segfault running the -I option.

I have a large job file, from profiling task data every second for a day and I got the fault processing it.

HDFView seems to have some trouble with it also. 

I did run sh5util with --extract on task data and the worked, so the problem is probably in the new code. 

I'm looking at it.
Comment 12 David Bigagli 2013-07-25 06:21:02 MDT
Same result.

->sh5util -I -s Task-1 -d RSS -j 5642
Jul 25 10:58:03.776368 14549 0x7fb837d15700 sh5util: Extracting 'RSS' from 'Task-1' data from ./job_5642.h5 into ./Task-1_RSS_5642.csv

Jul 25 10:58:03.777125 14549 0x7fb837d15700 sh5util: No values RSS for series Task-1 found in step 0
Jul 25 10:58:03.777278 14549 0x7fb837d15700 sh5util: No values RSS for series Task-1 found in step 1

I should mention that in hdfview I see the time series under the path:

/job/step/node/node_name/time series/task_1/task_1 data.

 David
Comment 13 Rod Schultz 2013-07-25 07:17:06 MDT
David,

That still looks like a minus.

However, extracting the task data is seriously flawed. 

The intent of this mode is to extract one data item from one series from all samples on all nodes. A task series (Task_1) is inherently on only one node.

I think what we want for tasks, is one data item from all tasks, from all samples.

I will work on a patch implementing that. 

Would you like a job file with some energy data? I'm sure it's too big to attache here, but I can email you one.

Or you can wait for my next patch. I will probably have something tomorrow.

Rod
Comment 14 David Bigagli 2013-07-25 07:36:27 MDT
I will wait for your next patch. 
Good luck.

 David
Comment 15 Rod Schultz 2013-07-26 05:00:24 MDT
Created attachment 358 [details]
ItemExtract with tasks

David,

I've fixed -s task to extract one data item from all tasks.

The patch is a complete replacement of the sh5util implemenation.

If you would like one based on the accumlated patches I can make one.
Comment 16 David Bigagli 2013-07-26 05:10:43 MDT
Rod,
    if you can make just one big patch that includes everything it would be 
better so I will apply it on a clean master branch.

David
Comment 17 Rod Schultz 2013-07-26 05:15:48 MDT
Comment on attachment 358 [details]
ItemExtract with tasks

The patch labeled ItemExtract with Tasks is a complete patch. You can apply it to a clean master.
Comment 18 David Bigagli 2013-07-26 06:20:07 MDT
Rod, i still have problems running the code:

->sh5util -j 5642 -I -s Task-1 -d RSS
sh5util: Extracting 'rss' from 'Task-1' data from ./job_5642.h5 into ./Task-1_rss_5642.csv

sh5util: No values rss for series Task-1 found in step 0
sh5util: No values rss for series Task-1 found in step 1
sh5util: error: Failed to stat ./Task-1_rss_5642.csv: No such file or directory

I see the code returning NULL in _get_series_data() in this block:

	gid_series = get_group(gid_level, series);
	if (gid_series < 0) {
		// This is okay, may not have ran long enough for
		// a sample (srun hostname)
		H5Gclose(gid_level);
		return NULL;
	}

my job ran one minute, I have a little program that malloc(), does IO and 
some computation so it uses resources for a minute and I launch it on 
8 cores.

I am using the sh5util correctly? I am sending you my aggregate file gzip 
in a separate email. It is just 6K.

Thanks,   David
Comment 19 Rod Schultz 2013-07-26 08:21:49 MDT
You no longer give a task number.

Since tasks only run on one node, it made no sense to build a table with one all samples of one data item.

Just do this.

sh5util -I -s Task -d rss -j 5642


There does seem to be a minor bug in that it reports on two steps. It must be similar to the missing step problem I found in the merge. I'll look at it Monday.
Comment 20 David Bigagli 2013-07-26 10:56:04 MDT
Ok it works now. 

Why do you think that reporting on both steps is not right. My understanding 
was the customer wanted the data for the whole job.

We don't document what are the valid names for -d and they are different 
from what is in the hdfview table view. We should print the valid keyword 
in the help and also document them in the man page I think.

Specifying an incorrect key name, I had to look at the code to get them 
right, the sh5util gets segmentation fault at this instruction:

1362	fprintf(fp,",%f",smp_series[ix]);

because:

dybagme-> p ix
$1 = 0
dybagme->p smp_series
$2 = (double *) 0x0
dybagme->

Have a nice weekend.

David
Comment 21 Rod Schultz 2013-07-29 01:24:21 MDT
You are right, without a step parameter we do want to extract all steps.

I was confused because there is a bug in that in the csv file there is only one step. I will fix that.

I will fix -d so that is accepts the titles in hdf5view. It currently accepts the names in the data structure and it should be the hdf5 labels.

I will also fix the help and man pages.

I will test for correct series names to address the seg fault.
Comment 22 Rod Schultz 2013-07-29 05:32:30 MDT
Created attachment 361 [details]
Fix seg fault on series name and use hdf5 labels for data names

David,

The attached patch addresses the issues you identified.

I added a list of data items by series in the man page, but in the help file I only referenced the man page. I thought listing them all would make help too verbose.

This patch replaces all previous ones and should be applied to a clean master.

Rod.
Comment 23 David Bigagli 2013-07-29 07:20:21 MDT
Rod,
    ok it works now I would like to propose few more changes before closing this bug.

1) The use of fatal() function does not allow for cleaning up data before exiting.

>sh5util -I -j 5642 --series=Task --data rod
sh5util: Extracting 'rod' from 'Task' data from ./job_5642.h5 into ./Task_rod_5642.csv

sh5util: PROFILE: rod is invalid data item for task data
sh5util: fatal: No data item rod

this will leave an empty file Task_rod_5642.csv behind.

2) info() messages should be debug() in my option as user just want to 
run the command I believe, some info() should be error() like:

	if (nsg_node < 0) {
		info("Node-Step is not HDF5 object");
		return;
	}

basically reduce the verbosity of the command just like unix cp or mv, unless 
users ask for explicit output or there is an error. 
On the same note I would remove this message:

  error("No data in %s (it has been removed)", params.output);

as it is redundant.

3) I think we should detect the input file does not exist either if the 
 input options is specified or not and print a one line error message like
 
 "Input file zumzum: no such file or directory"
 
rather then printing to stderr a stack of hdf5 messages.

>sh5util -I -j 5642 --series=Task --data rss --input /opt/slurm/master/linux/gather/david
sh5util: Extracting 'rss' from 'Task' data from /opt/slurm/master/linux/gather/david into ./Task_rss_5642.csv

HDF5-DIAG: Error detected in HDF5 (1.8.5-patch1) thread 0:
  #000: ../../src/H5F.c line 1509 in H5Fopen(): unable to open file
    major: File accessability
    minor: Unable to open file
  #001: ../../src/H5F.c line 1300 in H5F_open(): unable to read superblock
    major: File accessability
    minor: Read failed
  #002: ../../src/H5Fsuper.c line 307 in H5F_super_read(): unable to find file signature
    major: File accessability
    minor: Not an HDF5 file
  #003: ../../src/H5Fsuper.c line 144 in H5F_locate_signature(): unable to read file signature
    major: Low-level I/O
    minor: Unable to initialize object
  #004: ../../src/H5FDint.c line 142 in H5FD_read(): driver read request failed
    major: Virtual File Layer
    minor: Read failed
  #005: ../../src/H5FDsec2.c line 770 in H5FD_sec2_read(): file read failed: time = Mon Jul 29 12:01:21 2013
, filename = '/opt/slurm/master/linux/gather/david', file descriptor = 4, errno = 21, error message = 'Is a directory', buf = 0x7fff4366d0b0, size = 8, offset = 0
    major: Low-level I/O
    minor: Read failed
sh5util: error: Failed to open /opt/slurm/master/linux/gather/david
sh5util: error: No data in ./Task_rss_5642.csv (it has been removed)

I see you caught the compile warning! :-)

Thanks,
        David
Comment 24 Rod Schultz 2013-07-30 03:14:57 MDT
Created attachment 362 [details]
sh5util - rework info,debug,error,fatal

David,

I reworked all the info, debug, error, and fatal calls (even in existing sh5util functions.)

This patch replaces all previous ones and should be applied to a clean master.

Rod
Comment 25 Rod Schultz 2013-07-30 04:17:43 MDT
David,

It looks like I broke the merge part of sh5util.

Rod.
Comment 26 Rod Schultz 2013-07-30 04:26:54 MDT
Created attachment 363 [details]
Fix merge

David,

I edited the patch file. 

If you've already applied the patch, the problem is at line 367 of sh5util

It should be

	if (params.input && stat(params.input, &sb) == -1) {

I did a quick and dirty test of -E at it at least passed the smoke test.

Rod.
Comment 27 David Bigagli 2013-08-02 08:54:03 MDT
Fix committed.

David