Sitecore Managed Cloud Standard — Disaster Recovery Automation


Overview

This article describes the functions performed by various automation tasks within Sitecore’s Disaster Recovery (DR) solution. Managed Cloud Standard Disaster Recovery for PaaS leverages Azure’s automation account runbook to perform disaster recovery tasks.

The relevant services such as the automation account, runbooks, and runbook schedulers are created in the DR Control Resource Group during a DR setup.

There are three types of automation:

Snapshot

The Snapshot runbook creates the backup of various settings and services of the primary Sitecore environment. It is configured to perform the actions listed below every hour (except for the action "Web App backup scheduling"). The relevant configuration JSON files and Web App backups (mentioned among the actions in the table below) are stored within a storage account in the Control Resource Group.

Action Description
Backup Web App plan SKU Web App plan details such as Tier, Size, and Capacity of the primary Sitecore environment are collected and stored as a blob (WebAppSpecs.json). Only web apps that are provisioned in the primary region are included.
Backup Web App plan settings Web App plan settings of the primary Sitecore environment are collected and stored as a blob (WebAppSetting.json). Only web apps that are provisioned in the primary region are included.
Creating Web App backup filter A backup filter is applied to the primary Sitecore roles to exclude certain paths that are not required to be restored to the secondary environment. See the section "List of files that are excluded while performing backup for Disaster recovery" in KB0133741 for more details.
Web App backup scheduling Backup scheduling for Web App contents is configured for the primary Sitecore roles. The backup schedulers are configured to run at intervals of 12 hours. The retention period for backups is 3 days. Note that the interval and retention period configuration is stored as a blob (SchedulerConfig.json), which can be changed to suit the backup scheduling need.
Web App administrative operation alert creation An alert called "Administrative operation has failed" for a backup failure is configured. Note that a partial backup failure alert falls under administrative operations as this alert title indicates an administrative operation failure.
SQL Server database specification backup Sitecore’s SQL Server database specifications such as Edition, MaxSizeByes, and Capacity are collected from the primary environment and stored in the Control Resource Group storage account as a blob (DatabaseSpecs.json).
Azure Search properties backup (applicable only for Sitecore with Azure Search) Azure Search specifications such as replica count and partition count are collected from the primary environment and stored in the Control Resource Group storage account as a blob (SearchSpecs.json).
Redis Cache properties backup Redis Cache specifications such as Size, SKU, and shard count are collected from the primary environment and stored in the Control Resource Group storage account as a blob (CacheSpecs.json).
Web App autoscale settings backup Web App autoscale settings of the primary Sitecore environment are collected and stored in the Control Resource Group storage account as a blob (ScaleSetting.json). Only web apps that are provisioned in the primary region are included.
Web App backup storage account blob URL copy  The URL of the Web App backup path to a blob is saved into the Key Vault in the Control Resource Group.

Synchronize

The Synchronize runbook restores or updates relevant resources in the secondary environment based on backups performed by the Snapshot runbook. It is configured to perform the actions listed below every 3 hours.

Action Description
SQL database synchronization

Synchronization of the relevant databases to be included in the failover group.

The database specs are also updated from the backup done by the Snapshot runbook (DatabaseSpecs.json).

Restore Web App settings (applicable only for DR Managed Hot Standby) Secondary web apps are updated with the settings stored in the storage account by the Snapshot runbook (WebAppSetting.json).
Restore Web App plan SKU (applicable only for DR Managed Hot Standby) Secondary web apps are updated based on the settings stored in the storage account by the Snapshot runbook (WebAppSpecs.json).
Restore Redis Cache properties (applicable only for DR Managed Hot Standby) Secondary Redis Cache is updated with the settings stored in the storage account by the Snapshot runbook (CacheSpecs.json).
Restore Azure Search properties (applicable only for DR Managed Hot Standby with Azure Search) Secondary Azure Search is updated with the settings stored in the storage account by the Snapshot runbook (SearchSpecs.json).
Restore Web App autoscale settings (applicable only for DR Managed Hot Standby) Secondary web apps are updated with the autoscale settings stored in the storage account by the Snapshot runbook (ScaleSetting.json).
Restore web apps (applicable only for DR Managed Hot Standby)

Secondary web app contents are updated with the backups stored in the storage account and done by the Web App backup scheduler. The Synchronize runbook triggers the restore webjob in the secondary web apps to perform the restoration.

Note: The restore webjob is created in the secondary roles during a DR setup.

Note: Most of these actions are applicable for DR Managed Hot Standby. For DR Basic Cold Standby, the listed actions are performed during a failover.

State Manager

The State Manager runbook is only applicable for DR Managed Hot Standby, as it is required to perform automatic failover to the secondary environment and failback to the primary environment. The State Manager periodically checks the primary environment health status in order to perform either the failover or failback actions when required. The criteria are based on the health status of the Content Delivery role and the Sitecore Core database availability. The following resources are checked sequentially every 5 minutes at intervals of 30 seconds to determine whether a failover or a failback is required.

Action Description
Failover
Automation runbooks (Snapshot and Synchronize) are stopped Stops the backup and restore operations in the secondary environment.
Starts the web apps in the secondary environment The failover starts the Sitecore roles in the secondary environment before performing other actions to increase the RTO.
SQL Server failover group switching/failover

A forced failover of the SQL Server failover group to the secondary environment is performed. The forced failover immediately switches the secondary SQL Server to assume the role of the primary SQL Server without waiting indefinitely for recent changes to propagate from the now defunct primary SQL Server. However, due to asynchronous replication of data to the secondary SQL Server, this operation might result in potential data loss.

Updating the Shard tables in the primary SQL Server (applicable only for XP) Ensures the Shard tables are configured to point to the secondary databases.
Content indexing (applicable only for Azure Search) Ensures the up-to-date data is available after the SQL Server switchover.
Rebuilding the xDB index (applicable only for XP with Azure Search) Ensures the up-to-date data is available after the SQL Server switchover.
Failback
Stops the web apps in the secondary environment Ensures that only one instance of the Sitecore role is running, meaning that the primary Sitecore roles are running.
SQL Server failover group switching/failover Switches the SQL Server failover group to the primary environment.
Updating the Shard tables in the primary SQL Server (applicable only for XP) Ensures the Shard tables are configured to point to the primary databases.
Content indexing (applicable only for Azure Search) Ensures the up-to-date data is available after the SQL Server switchover.
Rebuilding the xDB index (applicable only for XP with Azure Search) Ensures the up-to-date data is available after the SQL Server switchover.
Automation runbooks (Snapshot and Synchronize) are started Resumes backup and restore operations in the secondary environment.

 

Note: The related article can be found here: Sitecore Managed Cloud Standard — Disaster Recovery.