Linux安装Slurm集群

安装规划

SLURM(Simple Linux Utility for Resource Management)是一个开源、高性能、可扩展的集群管理和作业调度系统,被广泛应用于大型计算集群和超级计算机中。它能够有效地管理集群中的计算资源(如CPU、内存、GPU等),并根据用户的需求对作业进行调度,从而提高集群的利用率。

  • master控制节点:
    • 172.16.45.29(920)
  • node计算节点:
    • 172.16.45.2(920)
    • 172.16.45.4(920)
图片

此次以centos 8 等rpm体系的Linux发行版为例。

创建账号

#! 删除数据库yum remove mariadb-server mariadb-devel -y#! 删除Slurm及Mungeyum remove slurm munge munge-libs munge-devel -y#! 删除用户userdel -r slurm userdel -r munge#! 创建用户export MUNGEUSER=1051 groupadd -g$MUNGEUSER munge useradd -m -c"MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u$MUNGEUSER -g munge -s /sbin/nologin mungeexport SLURMUSER=1052 groupadd -g$SLURMUSER slurm useradd -m -c"SLURM workload manager" -d /var/lib/slurm -u$SLURMUSER -g slurm -s /bin/bash slurm#! ssh免密码登录 # 控制节点执行 https://builtin.com/articles/ssh-without-passwordssh-keygen#! 拷贝密钥到计算节点ssh-copy-id 172.16.45.2 ssh-copy-id 172.16.45.4

Munge

Munge 是一个用于创建和验证用户凭证的身份验证服务,主要应用于大规模的高性能计算(HPC)集群中。它被设计为高度可扩展,能够在复杂的集群环境中提供安全可靠的身份验证。

https://github.com/dun/munge

Munge的作用

  • 身份验证

Munge 允许进程在具有相同普通用户(UID)和组(GID)的主机组中,对另一个本地或远程的进程进行身份验证。这些主机组构成了一个共享密码密钥的安全域。

  • 安全域

Munge 通过定义安全域来管理不同主机之间的信任关系。在同一个安全域内的主机可以相互信任,而不同安全域之间的主机则需要进行额外的身份验证。

  • 简化身份管理

Munge 可以简化HPC集群中的身份管理。通过使用Munge,管理员可以避免在每个节点上配置复杂的SSH密钥或Kerberos配置。

Munge的工作原理

Munge 通过生成和验证证书来实现身份验证。当一个进程需要访问另一个进程时,它会向Munge服务器请求一个证书。Munge服务器会验证请求者的身份,然后生成一个证书。这个证书会包含请求者的UID、GID以及其他一些信息。被访问的进程会验证这个证书,以确认请求者的身份。

Munge的优势

  • 高性能: Munge 被设计为能够处理大量身份验证请求。
  • 可扩展性: Munge 可以很容易地扩展到大型集群。
  • 安全性: Munge 提供了多种安全机制,可以防止未授权访问。
  • 易于使用: Munge 的配置相对简单,易于管理。

安装

#! 所有节点yum install epel-release -y yum install munge munge-libs munge-devel -y

管理节点生成secret key

yum install rng-tools -y rngd -r /dev/urandom /usr/sbin/create-munge-key -rddif=/dev/urandom bs=1 count=1024 > /etc/munge/munge.keychown munge: /etc/munge/munge.keychmod 400 /etc/munge/munge.key scp /etc/munge/munge.key root@172.16.45.2:/etc/munge scp /etc/munge/munge.key root@172.16.45.4:/etc/munge
#! 所有节点chown -R munge: /etc/munge/ /var/log/munge/chmod 0700 /etc/munge/ /var/log/munge/ systemctlenable munge systemctl start munge
#! 在主节点测试# munge -nMUNGE:AwQFAAD9xUgg77lK2Ts72xayqCe4IETD9sp4ZEJD8ZTCbDekcojBef1fveBK8YweUi/7ImJMUdw3rO+gl3P02K5cHJAJX0Xq74rhW+1EgZgJZcIxHy4Z3qmsPWk4rVzhJfKGgUQ=:# munge -n | mungeMUNGE:AwQFAACLbOsTGZWeENLUthY0WyyVWQ1HVEBbGIWEAobpAaLI2T1oMbHKjMO6zOvCTIKZcEPB/0CBhYxbpekFQwK7jeN7RMIxuZ+9dZFUF6jLEh0gbiLIpvgL1z3kGGwZNR+FMR6D/b1pUFPL4Mt9QQd4zjAIOvVnWCoXyE3XTfI64ZIbGJCZypMRj6nD7G2zgEVQ+v23vSPb81mnfC7ne1FaLIdNu9Iy8ZsESaxXJDrVoKFf/3Nax+Iw/LvauIbjF/Ps/Ok6aDcIAoPbOFWfbO7L2rovQzHt/3ABwwzH4yOGDdj9aWyqcyuqegDp/d8l6iJ7TIg=:# munge -n | ssh 172.16.45.2 unmungeAuthorizedusers only. All activities may be monitored and reported. STATUS: Success (0) ENCODE_HOST: ??? (172.16.45.29) ENCODE_TIME: 2024-12-10 16:16:55 +0800 (1733818615) DECODE_TIME: 2024-12-10 16:16:52 +0800 (1733818612) TTL: 300 CIPHER: aes128 (4) MAC: sha256 (5) ZIP: none (0) UID: root (0) GID: root (0) LENGTH: 0

安装Slurm

#! 所有节点yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel libibmad libibumad perl-ExtUtils-MakeMaker perl-devel gcc mariadb-devel pam-devel rpm-build -y wget https://download.schedmd.com/slurm/slurm-24.05.4.tar.bz2 rpmbuild -ta slurm-24.05.4.tar.bz2cd /root/rpmbuild/RPMS/aarch64/ yum --nogpgcheck localinstall * -y
#! 所有节点yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel libibmad libibumad perl-ExtUtils-MakeMaker perl-devel gcc mariadb-devel pam-devel rpm-build -y wget https://download.schedmd.com/slurm/slurm-24.05.4.tar.bz2 rpmbuild -ta slurm-24.05.4.tar.bz2cd /root/rpmbuild/RPMS/aarch64/ yum --nogpgcheck localinstall * -ymkdir -p /var/log/slurm/chown slurm: /var/log/slurm/# vi /etc/slurm/slurm.conf## Example slurm.conf file. Please run configurator.html# (in doc/html) to build a configuration file customized# for your environment.### slurm.conf file generated by configurator.html.# Put this file on all nodes of your cluster.# See the slurm.conf man page for more information.#ClusterName=cluster SlurmctldHost=Donau(172.16.45.29)#SlurmctldHost=##DisableRootJobs=NO#EnforcePartLimits=NO#Epilog=#EpilogSlurmctld=#FirstJobId=1#MaxJobId=67043328#GresTypes=#GroupUpdateForce=0#GroupUpdateTime=600#JobFileAppend=0#JobRequeue=1#JobSubmitPlugins=lua#KillOnBadExit=0#LaunchType=launch/slurm#Licenses=foo*4,bar#MailProg=/bin/mail#MaxJobCount=10000#MaxStepCount=40000#MaxTasksPerNode=512MpiDefault=none#MpiParams=ports=#-##PluginDir=#PlugStackConfig=#PrivateData=jobsProctrackType=proctrack/cgroup#Prolog=#PrologFlags=#PrologSlurmctld=#PropagatePrioProcess=0#PropagateResourceLimits=#PropagateResourceLimitsExcept=#RebootProgram=ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm#SlurmdUser=root#SrunEpilog=#SrunProlog=StateSaveLocation=/var/spool/slurmctld SwitchType=switch/none#TaskEpilog=TaskPlugin=task/affinity#TaskProlog=#TopologyPlugin=topology/tree#TmpFS=/tmp#TrackWCKey=no#TreeWidth=#UnkillableStepProgram=#UsePAM=0### TIMERS#BatchStartTimeout=10#CompleteWait=0#EpilogMsgTime=2000#GetEnvTimeout=2#HealthCheckInterval=0#HealthCheckProgram=InactiveLimit=0 KillWait=30#MessageTimeout=10#ResvOverRun=0MinJobAge=300#OverTimeLimit=0SlurmctldTimeout=120 SlurmdTimeout=300#UnkillableStepTimeout=60#VSizeFactor=0Waittime=0### SCHEDULING#DefMemPerCPU=0#MaxMemPerCPU=0#SchedulerTimeSlice=30SchedulerType=sched/backfill SelectType=select/cons_tres### JOB PRIORITY#PriorityFlags=#PriorityType=priority/multifactor#PriorityDecayHalfLife=#PriorityCalcPeriod=#PriorityFavorSmall=#PriorityMaxAge=#PriorityUsageResetPeriod=#PriorityWeightAge=#PriorityWeightFairshare=#PriorityWeightJobSize=#PriorityWeightPartition=#PriorityWeightQOS=### LOGGING AND ACCOUNTING#AccountingStorageEnforce=0#AccountingStorageHost=#AccountingStoragePass=#AccountingStoragePort=AccountingStorageType=accounting_storage/none#AccountingStorageUser=#AccountingStoreFlags=#JobCompHost=#JobCompLoc=#JobCompPass=#JobCompPort=JobCompType=jobcomp/none#JobCompUser=#JobContainerType=JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurm/slurmd.log#SlurmSchedLogFile=#SlurmSchedLogLevel=#DebugFlags=### POWER SAVE SUPPORT FOR IDLE NODES (optional)#SuspendProgram=#ResumeProgram=#SuspendTimeout=#ResumeTimeout=#ResumeRate=#SuspendExcNodes=#SuspendExcParts=#SuspendRate=#SuspendTime=### COMPUTE NODESNodeName=rabbitmq-node1 NodeAddr=172.16.45.2 CPUs=128 State=UNKNOWN NodeName=gczxagenta2 NodeAddr=172.16.45.4 CPUs=128 State=UNKNOWN PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

控制节点

mkdir /var/spool/slurmctldchown slurm: /var/spool/slurmctldchmod 755 /var/spool/slurmctldtouch /var/log/slurm/slurmctld.logchown slurm: /var/log/slurm/slurmctld.logtouch /var/log/slurm/slurm_jobacct.log /var/log/slurm/slurm_jobcomp.logchown slurm: /var/log/slurm/slurm_jobacct.log /var/log/slurm/slurm_jobcomp.log

计算节点

mkdir /var/spool/slurmchown slurm: /var/spool/slurmchmod 755 /var/spool/slurmtouch /var/log/slurm/slurmd.logchown slurm: /var/log/slurm/slurmd.log

所有节点测试配置:

# slurmd -C # 确认没有报错NodeName=rabbitmq-node1 CPUs=128 Boards=1 SocketsPerBoard=128 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=514413 UpTime=12-07:19:32# yum install ntp -y# chkconfig ntpd on# ntpdate pool.ntp.org# systemctl start ntpd

计算节点

systemctlenable slurmd.service systemctl start slurmd.service systemctl status slurmd.service# 此时管理节点没有启动,报错是正常的。

参考资料

  • 软件测试精品书籍文档下载持续更新 https://github.com/china-testing/python-testing-examples 请点赞,谢谢!
  • 本文涉及的python测试开发库谢谢点赞! https://github.com/china-testing/python_cn_resouce
  • python精品书籍下载 https://github.com/china-testing/python_cn_resouce/blob/main/python_good_books.md
  • Linux精品书籍下载 https://www.cnblogs.com/testing-/p/17438558.html
  • https://github.com/Artlands/Install-Slurm

主节点安装MariaDB

yum install mariadb-server mariadb-devel -y systemctlenable mariadb systemctl start mariadb systemctl status mariadb mysql MariaDB[(none)]> GRANT ALL ON slurm_acct_db.* TO'slurm'@'localhost' IDENTIFIED BY'1234' with grant option; MariaDB[(none)]> SHOW VARIABLES LIKE'have_innodb'; MariaDB[(none)]> FLUSH PRIVILEGES; MariaDB[(none)]> CREATE DATABASE slurm_acct_db; MariaDB[(none)]> quit;# vi /etc/my.cnf.d/innodb.cnf[mysqld] innodb_buffer_pool_size=1024M innodb_log_file_size=64M innodb_lock_wait_timeout=900# systemctl stop mariadbmv /var/lib/mysql/ib_logfile? /tmp/ systemctl start mariadb# vim /etc/slurm/slurmdbd.conf## Example slurmdbd.conf file.## See the slurmdbd.conf man page for more information.## Archive info#ArchiveJobs=yes#ArchiveDir="/tmp"#ArchiveSteps=yes#ArchiveScript=#JobPurge=12#StepPurge=1## Authentication infoAuthType=auth/munge#AuthInfo=/var/run/munge/munge.socket.2## slurmDBD infoDbdAddr=localhost DbdHost=localhost#DbdPort=7031SlurmUser=slurm#MessageTimeout=300DebugLevel=verbose#DefaultQOS=normal,standbyLogFile=/var/log/slurm/slurmdbd.log PidFile=/var/run/slurmdbd.pid#PluginDir=/usr/lib/slurm#PrivateData=accounts,users,usage,jobs#TrackWCKey=yes## Database infoStorageType=accounting_storage/mysql#StorageHost=localhost#StoragePort=1234DbdPort=6819 StoragePass=1234 StorageLoc=slurm_acct_db# chown slurm: /etc/slurm/slurmdbd.confchmod 600 /etc/slurm/slurmdbd.conftouch /var/log/slurmdbd.logchown slurm: /var/log/slurmdbd.log systemctlenable slurmdbd systemctl start slurmdbd systemctl status slurmdbd systemctlenable slurmctld.service systemctl start slurmctld.service systemctl status slurmctld.service

验证

# sinfoPARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 1 idle gczxagenta2,rabbitmq-node1# srun -N2 -l /bin/hostname0: gczxagenta2 1: rabbitmq-node1

巨多的坑

  • fatal error: EXTERN.h :执行 yum -y install perl-devel一般可以解决
  • 管理节点和计算节点不要部署在同一台
  • munged: Error: Logfile is insecure: group-writable permissions set on “/var/log”
    有时启动会对日志文件的权限有要求,比如:755
  • error: auth_p_get_host: Lookup failed for 172.16.45.34

建议在hosts文件添加IP和主机名的映射,比如:

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 172.16.45.29 Donau 172.16.45.18 rabbitmq-node2 172.16.45.2 rabbitmq-node1 172.16.45.34 Donau2 172.16.45.4 gczxagenta2
  • error: Configured MailProg is invalid: 这个错误无需处理
  • _read_slurm_cgroup_conf: No cgroup.conf file (/etc/slurm/cgroup.conf), using defaults 这个错误无需处理
  • srun: error: Task launch for StepId=12.0 failed on node : Invalid node

检查node的ip等是否有重复

声明:文中观点不代表本站立场。本文传送门:https://eyangzhen.com/424178.html

联系我们
联系我们
分享本页
返回顶部