在跑srun时发生这个问题,我有配置gpu,但一直报错,上网搜索也都无法解决。
srun: error: Unable to allocate resources: Invalid node name specified
以下是Slurm文件的配置
/etc/slurm-llnl/slurm.conf
ClusterName=cool
ControlMachine=aim116-MS-7B51
#ControlAddr=
#BackupController=
#BackupAddr=
#
MailProg=/usr/bin/s-nail
SlurmUser=root
#SlurmdUser=root
SlurmctldPort=6817
GresTypes=gpu
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/spool/slurmctld
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir=
#FirstJobId=
ReturnToService=0
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
#TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
#SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SelectType=select/linear
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
#LOGGING
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
#ACCOUNTING
#JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
#
#COMPUTE NODES
PartitionName=aim116-MS-7B51 Nodes=aim116-MS-7B51 Default=NO MaxTime=INFINITE State=UP
#NodeName=aim116-MS-7B51 State=UNKNOWN
#NodeName=aim116-MS-7B51 Sockets=1 CoresPerSocket=8 ThreadsPerCore=1 State=UNKNOWN
NodeName=aim116-MS-7B51 Gres=gpu:1 CPUs=8 RealMemory=64257 Sockets=1 CoresPerSocket=8 State=UNKNOWN
/etc/slurm-llnl/gres.conf
NodeName=aim116-MS-7B51 Name=gpu File=/dev/nvidia0
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
aim116-MS-7B51 up infinite 1 idle aim116-MS-7B51
$ scontrol show node aim116-MS-7B51
NodeName=aim116-MS-7B51 Arch=x86_64 CoresPerSocket=8
CPUAlloc=0 CPUTot=8 CPULoad=3.14
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:1
NodeAddr=aim116-MS-7B51 NodeHostName=aim116-MS-7B51 Version=19.05.5
OS=Linux 5.11.0-38-generic #42~20.04.1-Ubuntu SMP Tue Sep 28 20:41:07 UTC 2021
RealMemory=64257 AllocMem=0 FreeMem=8792 Sockets=1 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=aim116-MS-7B51
BootTime=2022-09-28T16:44:47 SlurmdStartTime=2022-12-25T18:55:30
CfgTRES=cpu=8,mem=64257M,billing=8
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s