VALAR HPC (DC2) Cluster

VALAR HPC Cluster

Koç University is gradually building a new state-of-the-art HPC cluster, the installation work of which will begin in 2024. The installation/growth work of this cluster, named VALAR, is progressing step by step and some Koç University research groups have started working on this cluster. As of January 2025, the components of the VALAR cluster are as follows.

VALAR HPC system have following components:

Login Nodes
Compute Nodes
Service Nodes - Load Balancer, Headnode, MariaDB database, Zabbix, Rocky IDM
Network Infrastructure - High Speed Network (InfiniBand)
Storage - Faster, Scalable, Parallel File System (BeeGFS)

Login nodes are shared with all users, so no resource intensive processes may be run on the login nodes. The purpose of a login node is to prepare to run a program (e.g., moving/editing files, compiling).

There are currently two virtual login nodes on VALAR HPC Cluster. When user connect to “login” nodes via ssh, connection is forwarded to one of login nodes, login01 or login02.

Hostname	Specifications
login[01-02]	2 x Intel(R) Xeon(R) Gold 6430, 2 x 8 cores (16 cores), 128 GB RAM, 1 x Nvidia RTX A4000

Compute Nodes

Compute nodes are designated for running programs and performing computations. Any process that uses a significant amount of computational or memory resources must be executed on a compute node. The VALAR HPC Cluster consists of 54 compute nodes, all donated by different research groups.

Research Group Donated Nodes

These nodes are donated by research groups, and priority rules apply.
Members of the corresponding research group have priority access to these nodes.
Interruption Rule: If a research group member submits a job to their donated nodes and insufficient resources are available:
- The workload manager checks the node’s resources.
- It cancels and requeues the lowest-priority running job(s) to free up resources for the higher-priority job.

The valar-nodes command provides detailed information about the nodes and their resources, including available CPUs, memory, GPUs, and the current status of each node.

Hostname	Specifications
ai[01-10]	2 x Intel(R) Xeon(R) Gold 6248 @ 2.50GHz, 2 x 20 cores (40 cores), 512 GB RAM, 8 x Tesla T4
ai[11-14]	2 x Intel(R) Xeon(R) Gold 6248 @ 2.50GHz, 2 x 20cores (40 cores), 512 GB RAM, 8 x Tesla V100 NV-Link
ai15	2 x Intel(R) Xeon(R) Gold 6342 @ 2.80GHz,2 x 24cores (48 cores), 512 GB RAM, 4 x RTX_A6000
ai16	2 x Intel(R) Xeon(R) Gold 6342 @ 2.80GHz,2 x 24cores (48 cores), 512 GB RAM, 8 x RTX_A6000
ai17	2 x Intel(R) Xeon(R) Gold 5317 CPU @ 3.00GHz, 2 x 12 cores (24 cores), 512 GB RAM, 4 x RTX_A6000
ai[18-26]	2 x AMD EPYC 9224 24-Core Processor, 2 x 24 cores (48 cores), 512 GB RAM, 8 x Nvidia A40
ai[27-34]	2 x Intel(R) Xeon(R) Gold 5418Y, 2 x 24 cores (48 cores), 512 GB RAM, 4 x Nvidia L40S
gm01	2 x AMD EPYC 9534 64-Core Processor, 2 x 64 cores (128 cores), 1536 GB RAM
star[01-04]	2 x AMD EPYC 9654 96-Core Processor, 2 x 96 cores (192 cores), 768 GB RAM, 2 x Nvidia L40S
star[05-06]	2 x AMD EPYC 9654 96-Core Processor, 2 x 96 cores (192 cores), 768 GB RAM
kuttam[01-02]	2 x INTEL(R) XEON(R) SILVER 4514Y, 2 x 16 cores (32 cores), 256 GB RAM
ag01	2 x Intel(R) Xeon(R) CPU E5-2637 v4 @ 3.50GHz, 2 x 4 cores (8 cores), 128 GB RAM, 2 x GTX 1080ti GPU
ag04	2 x Intel(R) Xeon(R) Silver 4215R CPU @ 3.20GHz, 2 x 4 cores (8 cores), 256 GB RAM, 1 x Tesla T4
rk01	2 x Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz, 2 x 18 cores (36 cores), 512 GB RAM
rk02	2 x AMD EPYC 7742 64-Core Processor, 2 x 64cores (128 cores), 512 GB RAM, 8 x Tesla_A100
it[01–04]	2 x Intel(R) Xeon(R) Gold 6148 @ 2.40GHz, 2 x 20cores (40 cores), 512 GB RAM, 1 x Tesla V100
comiecs01	2 x AMD EPYC 9554 64-Core Processor, 2 x 64cores (128 cores), 512 GB RAM
cyberiad01	2 x AMD EPYC 9455 48-Core Processor, 2 x 48cores (96 cores), 768 GB RAM
ai35	2 x Intel(R) Xeon (R) Gold 5418Y, 2 x 24cores (48 cores), 512 GB RAM

Service & Virtualization Nodes

HPC services work on Proxmox Virtual Environment management nodes. Users are not allowed to login into these nodes.

Network Infrastructure

To build a high performance computing system, servers and storages need to be connected with high-speed (high-performance) networks.

VALAR HPC Cluster uses Mellanox HDR 200Gb/s InfiniBand switch that features very high throughput and very low latency.

Parallel File System

Also known as a clustered/distributed file system. It separates data and metadata into separate services allowing HPC clients to communicate directly with the storage servers.
VALAR HPC cluster uses full NVMe Beegfs Parallel File System for both home & scratch space. It has buddy mirroring which provides high availability features.

File Storage

Inactive data in cluster is stored in slower and lower-cost storage units compared to parallel file system.

VALAR HPC cluster uses Dell & Netapp Enterprise Storage units. While NetApp storage has /datasets and /userfiles areas, Dell storage has /frozen area where archive data is stored.

There is NO BACKUP on HPC systems. Users are responsible to back up their files!