How do I interact with Jobs in Real Time?
Interactive Jobs
Batch jobs are submitted to slurm queuing system and runs when there is requested resource available. However, it can’t be used when user test and troubleshoot code in real time. Interactive jobs allow to interact with applications in real time. Users can then run graphical user interface (GUI) applications, execute scripts, or run other commands directly on a compute node.
Using srun command:
srun will submit your resource request to the queue. When the resource is available, a new bash session starts on reserved compute node. Same slurm flags are used for srun command.
Example:
## For KUACC
srun -N 1 -n 4 -A users -p short --qos=users --gres=gpu:1 --mem=64G --time 1:00:00 --constraint=tesla_v100 --pty bash
By this command, slurm reserves 1 node, 4 cores, 64GB RAM, 1 gpu and constraint flag limits gpu type to tesla_v100 gpus with 1 hour time limit in short queue. Then, opens a terminal on compute node. If the terminal on compute node is closed, job is killed on queue.
## For VALAR
srun -p ai --gres=gpu:1 --mem=20G --cpus-per-task=4 --time=02:00:00 --pty bash
By this command, Slurm reserves resources on ai partition with 1 GPU of any type (default T4 nodes if available), 4 CPU cores, and 20GB RAM for 2 hours. Then, opens a terminal on a compute node. If the terminal is closed, the job is killed on the queue.
## For VALAR with constraint
srun -p ai --gres=gpu:tesla_v100:1 --mem=30G --cpus-per-task=6 --time=04:00:00 --pty bash
By this command, Slurm reserves resources on ai partition with 1 GPU of type Tesla V100, 6 CPU cores, and 30GB RAM for 4 hours. Then, opens a terminal on a compute node. Closing the terminal will terminate the job.
Using salloc command:
salloc works same as srun. It will submit your resource request to queue. When the resource is available, it opens a terminal on the login node. However, you will have permission to ssh to reserved node.
srun --pty bash
Example: Same as in srun
## For KUACC
salloc -N 1 -n 4 -A users -p short --qos=users --gres=gpu:1 --mem=64G --time 1:00:00 --constraint=tesla_v100
By this command, Slurm reserves 1 node on the short partition with 4 CPU tasks, 64GB RAM, and 1 GPU. The --constraint=tesla_v100 flag restricts the allocation to nodes with Tesla V100 GPUs. The --qos=users and -A users options ensure the job runs under the users account and QoS. The time limit is set to 1 hour. After allocation, the user can start processes on the reserved node (for example with srun --pty bash). If the session is closed, the job is terminated in the queue.
## For VALAR
salloc -p ai --gres=gpu:ampere_a40:1 --cpus-per-task=4 --mem=20G --time=03:00:00
By this command, Slurm allocates resources on ai partition with 1 GPU of type NVIDIA A40, 4 CPU cores, and 20GB RAM for 3 hours. After allocation, the user can start a shell on the compute node with srun --pty bash.
squeue -u username
or
kuacc-queue|grep username
ssh username@computenode_name