Backup and restore jobs fail with timeout error
Due to reduced availability of resources on NetBackup Snapshot Manager server, backup and restore jobs fail as the jobs are in continuos search of memory due to which other services may also fail with the timeout error. This issue may be due to multiple jobs running together beyond the capacity of the host. On a cluster setup, the jobs may fail to schedule on nodes because of the maximum pods per node setting. The backup or restore jobs may fail, if the maximum pods per nodes are set to a lower number than the recommended value according to the node capability.
Workaround:
To resolve this issue, manually configure the following to set the maximum jobs that can run on a single node at a time:
host using the
/cloudpoint/flexsnap.conffileOr
cluster using the
flexsnap-confconfig map
[capability_limit] max_jobs = <num>
where, <num> is the maximum number of jobs that can run at a time on a node.
In case of multiple jobs running in parallel, if any service fails due to non availability of resources then reduce the number of parallel jobs that can be performed on the provided node type.