IT Help

hw-smi

hw-smi is a command-line visualization tool designed to provide real-time telemetry for hardware components on a compute node. It offers a comprehensive overview of CPU and GPU performance metrics in a single, easy-to-read dashboard, similar to the monitoring capabilities of other system tools.

Activation

To use the utility, you must first load the corresponding module into your environment:

module load hw-smi 

Usage

Launch the real-time monitor by running the following command:

hw-smi 

CPU Metrics

The upper section of the dashboard displays detailed information regarding the central processor:

  • Model: Identifies the specific CPU architecture (e.g., Intel Xeon Gold 6430).

  • Core Utilization: A visual bar graph representing the real-time load on individual CPU cores.

  • Usage: The aggregated percentage of total CPU resources currently in use.

  • RAM: Monitors physical memory consumption, displaying used versus total available capacity in Megabytes (MB).

  • Clock: Displays the current operating frequency against the maximum possible clock speed in MHz.

  • Temp: A thermal gauge indicating the current operating temperature of the processor.

  • PCIe BW: Real-time monitoring of PCIe bus bandwidth usage (MB/s).

GPU Metrics

The lower section provides telemetry for the installed graphics accelerators (e.g., NVIDIA RTX A4000):

  • Usage: Real-time GPU core utilization percentage.

  • VRAM BW: Video memory bandwidth utilization.

  • VRAM: Dedicated video memory usage, showing the current footprint against the total capacity (MB).

  • Temp: Current GPU temperature relative to its thermal limit.

  • Power: Real-time power consumption measured in Watts (W) against the maximum Thermal Design Power (TDP).

  • Fan: Cooling fan speed measured in Revolutions Per Minute (RPM).

  • Clock / Mem Clk: The current core and memory clock speeds in MHz.

  • PCIe BW: Dedicated PCIe bandwidth usage for the GPU interface.
    Note: hw-smi is particularly effective for identifying performance bottlenecks, such as thermal throttling or unbalanced core loads, by providing a high-fidelity visual representation of hardware telemetry during heavy workloads.