jtop is a monitoring tool used to track the real-time resource usage of jobs running on the cluster, including CPU load, memory consumption, and GPU status. It is particularly useful for analyzing the efficiency of jobs across different nodes.
Usage
jtop [options]
Options
-
The following flags can be used to filter or modify the output:
-
-n <node>: Queries jobs on a specific target node.
-
-u <user>: Filters the job list by a specific username.
-
-p <partition>: Filters jobs by a specific partition or queue.
-
-l: Shows only the jobs running on the local node.
-
-h: Displays the help message and usage details.
Output Columns
When you run jtop, the output contains the following information:
-
JOBID: The unique ID assigned to the job by Slurm.
-
USER: The owner of the running job.
-
ELAPSED: The total wall-clock time the job has been running (Format: Days-Hours:Minutes:Seconds).
-
CPU: The total number of CPU cores allocated to the job.
-
RUN: The current CPU utilization rate (e.g., 0.99 indicates 99% usage of a core).
-
D: Represents disk I/O or processes in an uninterruptible sleep state.
-
RSS(MB): The current Resident Set Size (physical memory) being used, measured in Megabytes.
-
GPU: The allocated GPU resources and their specific types.
-
NODE: The specific compute node where the job is executing.
Examples
List all jobs in the 'ai' partition:
/usr/bin/jtop -p ai
List all jobs for a specific user:
/usr/bin/jtop -u valar
List jobs for a specific user on a specific node:
/usr/bin/jtop -n ai01 -u valar
Show only the jobs running on the local node:
/usr/bin/jtop -l
Note: jtop provides a snapshot of active processes. Monitoring the RUN and RSS(MB) columns is recommended to ensure your jobs are utilizing the requested resources effectively without hitting limits.