by Dave Turner of K-State, contact him at daveturner@ksu.edu with any comments or questions.
Dave Turner has developed a cluster monitoring tool called kstat over the past ten years. A Perl script runs on each compute node collecting data from the Slurm scontrol command, mining the /proc file system for memory and utilization information on running jobs, and parsing the nvidia-smi output when NVIDIA GPUs are present. This information is stored in a PostgreSQL database once per minute. When a user types kstat this information is retrieved for the requested jobs and hosts, combined with information from the Slurm squeue, sacct, and scontrol commands, and presented in a very user-friendly manner. The output is colorized providing warnings in a yellow background and errors in red background to help users identify when host nodes are down or their jobs are running inefficiently.
This software is installed at both Kansas State University and at Wichita State University. It has proven invaluable in providing our users with more in depth information about their jobs such as current and maximum memory usage, providing CPU and GPU utilization data without having to ssh into a compute node, and can even dump this data out in table or graph forms to see the changes in performance over time.
If you are interested in giving kstat a try please download the code at the GitHub link below. Feel free to contact Dave if you need any installation help, and any feedback would be very welcome.
https://github.com/DrDaveTurner/kstat
Questions or comments: Dave Turner – daveturner@ksu.edu

