Quiesce Kubernetes Workloads during Backups

Quiescing involves pausing an application to achieve a consistent state in preparation for a backup.

Cohesity supports quiescing stateful Kubernetes workloads before capturing Persistent Volume Claims (PVCs) volume snapshots during backup in a namespace. After the snapshots are taken, the Kubernetes workloads are automatically unquiesced. This ensures consistent and reliable snapshots for stateful applications running in Kubernetes.

When running a protection job for a namespace, there are pre-snapshot and post-snapshot hooks. Before taking a snapshot of any PVC attached to a Kubernetes pod within an application, the application should be in a consistent state. After taking the snapshot, a post-snapshot hook is executed called "unquiescing," which allows applications to resume their operations.

You can define actions to be executed before and after taking a snapshot. Hooks allow specific scripts to run within a group of pods identified by key-value pairs known as label selectors (each pod will have its labels). You can select pods constituting an application (which needs to be quiesced) by specifying labels. Velero will execute the designated script within the pod's container and provide the exit status of the script.

Each script has a timeout feature, with a default of 30 seconds. You can extend this duration if necessary. The scripts are executed at the container level, running on each container within a pod. You can specify a container name to execute a script on specific containers within the same pod.

To quiesce Kubernetes workloads during backups:

  1. Register Kubernetes source.

  2. In the Kubernetes source, select Namespaces > Quiesce Rules to view the different quiesce modes.

  3. Choose a Quiesce Mode to add different rules.. The following quiesce modes are available:

    • Apply following rules together

    • Apply following rules independently

    • Apply following rules sequentially

    Application quiescing involves defining one or more rules (for example, Rule 1, Rule 2, and Rule 3) to pause or quiesce an application , allowing snapshots to be taken. Once the snapshots have been captured, the application resumes and the backup process continues. The execution of these rules is managed by the quiescing mode, which offers two options: parallel mode, in which rules are executed simultaneously, and sequential mode, where rules are executed one after another.

    • Together mode: This mode operates using a single volume group, allowing all processes to run independently and in parallel. Quiescing occurs simultaneously, with all rules executed before a snapshot is taken. After applying the rules, snapshots are created, followed by unquiescing, and then the backup is performed. This approach results in faster backup speeds compared to sequential mode.

    • Independent mode: This mode operates with multiple volume groups, allowing for parallel and independent processes. Quiescing happens simultaneously; for example, while volume group 1 is quiescing for Rules 1 and 2, volume group 2 may have already completed the quiescing and begun the snapshot process. As a result, this mode provides the fastest backup speed.

    • Sequential mode: This mode operates with a single volume group, where all processes are interdependent and executed in a specific sequence. Quiescing occurs in stages, following defined rules. For example, Rule 1 is executed first, then Rule 2, followed by Rule 3, and so on. After applying these rules, snapshots are created, the system is unquiesced, and the backup process is performed. This sequential approach applies to both the quiescing and unquiescing processes. Consequently, the backup speed in this mode is the slowest compared to the other two options.

      In sequential mode, rules are executed one after another. If Rule 2 encounters an issue, you can either continue or stop all backups at the namespace level. This means that any error will cause the backup to fail for all rules. This mechanism manages backup processes in case of failures related to quiescing rules.

  4. Click Add Rule to add a rule. You can add multiple rules by clicking Add Rule.

  5. Add the Key value under the Pod Selector Labels for the rules to run on.

  6. Select the Pre Snapshot Scripts.
  7. Enter the container name, the command that needs to be run, and the timeout value in seconds. You can add multiple scripts by clicking Add Scripts for a particular rule selected under Pre Snapshots Scripts.

    Enter the command sh -c pre_script.sh in the command prompt.



    Example script on the source:

    #!/bin/sh

    LOG_FILE="/tmp/pre_script.log"

    echo "$(date) : Starting WordPress maintenance mode ENABLE" >>

    "$LOG_FILE"

    if wp maintenance-mode activate >> "$LOG_FILE" 2>&1; then

    echo "$(date) : Maintenance mode ENABLE completed successfully" >>

    "$LOG_FILE"

    else

    echo "$(date) : ERROR – Maintenance mode ENABLE failed" >>

    "$LOG_FILE"

    exit 1

    Fi
  8. Select the Post Snapshot Scripts.
  9. Enter the container name, the command that needs to be run, and the timeout value in seconds. You can add multiple scripts by clicking Add Scripts for a particular rule selected under Post Snapshots Scripts.

    Enter the command sh -c post_script.sh in the command prompt.

    Example script on the source:

    #!/bin/sh

    LOG_FILE="/tmp/post_script.log"

    echo "$(date) : Starting WordPress maintenance mode DISABLE" >>

    "$LOG_FILE"

    if wp maintenance-mode deactivate >> "$LOG_FILE" 2>&1; then

    echo "$(date) : Maintenance mode DISABLE completed successfully" >>

    "$LOG_FILE"

    else

    echo "$(date) : ERROR – Maintenance mode DISABLE failed" >>

    "$LOG_FILE"

    exit 1

    Fi

  10. You can choose between "True" or "False" to indicate whether the backup should continue or fail when the script fails by selecting Continue Backup when Script fails. If the script fails, the protection job will still back up the PVC. However, if the script fails, the protection job will also fail.
  11. Run the scripts. Ensure that the specified script is located at the correct path within the pod before you execute it.

    The pre-snapshot hook is triggered prior to taking snapshots of the PVCs. After the backup of the PVCs is completed, the post-snapshot hook is triggered to finalize the process. The Backup Task Activity for quiescing-ins, as shown in the pulse logs, will provide details about the execution of both the pre and post-snapshot hooks.