hw-smi is a command-line visualization tool designed to provide real-time telemetry for hardware components on a compute node. It offers a comprehensive overview of CPU and GPU performance metrics in a single, easy-to-read dashboard, similar to the monitoring capabilities of other system tools.
Activation
To use the utility, you must first load the corresponding module into your environment:
module load hw-smi
Usage
Launch the real-time monitor by running the following command:
hw-smi
CPU Metrics
The upper section of the dashboard displays detailed information regarding the central processor:
-
Model: Identifies the specific CPU architecture (e.g., Intel Xeon Gold 6430).
-
Core Utilization: A visual bar graph representing the real-time load on individual CPU cores.
-
Usage: The aggregated percentage of total CPU resources currently in use.
-
RAM: Monitors physical memory consumption, displaying used versus total available capacity in Megabytes (MB).
-
Clock: Displays the current operating frequency against the maximum possible clock speed in MHz.
-
Temp: A thermal gauge indicating the current operating temperature of the processor.
-
PCIe BW: Real-time monitoring of PCIe bus bandwidth usage (MB/s).
GPU Metrics
The lower section provides telemetry for the installed graphics accelerators (e.g., NVIDIA RTX A4000):
-
Usage: Real-time GPU core utilization percentage.
-
VRAM BW: Video memory bandwidth utilization.
-
VRAM: Dedicated video memory usage, showing the current footprint against the total capacity (MB).
-
Temp: Current GPU temperature relative to its thermal limit.
-
Power: Real-time power consumption measured in Watts (W) against the maximum Thermal Design Power (TDP).
-
Fan: Cooling fan speed measured in Revolutions Per Minute (RPM).
-
Clock / Mem Clk: The current core and memory clock speeds in MHz.
-
PCIe BW: Dedicated PCIe bandwidth usage for the GPU interface.
Note: hw-smi is particularly effective for identifying performance bottlenecks, such as thermal throttling or unbalanced core loads, by providing a high-fidelity visual representation of hardware telemetry during heavy workloads.