SLURM – Job Scheduling and Resource Management

At KUACC and DC2, we use SLURM (Simple Linux Utility for Resource Management) to efficiently manage computing resources like CPUs, memory, and GPUs, and to schedule jobs on the clusters. SLURM is a widely adopted open-source tool in the HPC community, powering around 65% of the world’s top 500 supercomputers as of 2025.

Key SLURM Concepts

Cluster: The entire set of connected computers working together to run your computing jobs. (For example, KUACC HPC cluster)

Node: A single computer or server within the cluster where your job runs. (For example, ai01, ai02)

Partition: A group of nodes with shared settings and resource limits. Nodes can belong to multiple partitions. Think of partitions as different “queues” with their own rules. (Examples: users, ai, starlet)

Account: A user group that manages resource limits and job priorities. Each user belongs to an account.

QoS (Quality of Service): A way to assign job priorities, limits, and special features. Jobs can request QoS to influence scheduling.

Task: The smallest unit of work in SLURM, usually a single process. By default, one task runs on one CPU core.

Job: A collection of one or more tasks submitted to SLURM. Each job has a unique job ID assigned by SLURM and a name chosen by the user.

Batch and Interactive Jobs

All users must submit jobs in order to reserve resources and run their code on the cluster.

There are two main types of jobs in SLURM: batch jobs and interactive jobs.

Batch jobsare submitted to the scheduler and run in the background without user interaction. These jobs are ideal for long or unattended computations.

Interactive jobsallow users to access a compute node directly for real-time testing, debugging, or exploratory work.

For details on how to submit each type, please refer to the links below: