Sitecore Managed Cloud Standard (MCS) PaaS 1.0 - Disaster Recovery


Migration of Disaster Recovery types for the existing customers

The following Disaster Recovery (DR) types will be changed to:

Take note that the support for the new terminology is currently ongoing. The deprecation of the old DR types will take effect once we have migrated the existing customers’ DR environments.

The table below shows the mapping between the old and new DR Types.

Old New Notes
  Basic   DR Basic Cold Standby   Upgrade
  Basic Geo-Replication   DR Basic Cold Standby   Exact match
  Hot-Warm   DR Managed Hot Standby   Upgrade
  Hot Manual   DR Managed Hot Standby   Upgrade
  Hot Auto   DR Managed Hot Standby   Exact match

Description

The Sitecore Managed Cloud Disaster Recovery feature allows customers to maintain or quickly resume mission-critical functions following a disaster, therefore supporting the customer's business continuity plan. When a disaster occurs in a region containing the production environment (primary), the Disaster Recovery tool allows the environment to be recovered into another region (secondary) or a disaster recovery site.

Sitecore currently provides two disaster recovery options:

This article provides information on the disaster recovery configurations, workflows, and architectural aspects to be aware of.

Overview

When a disaster happens, Sitecore must receive an alert within 15 minutes. On the basis of the alert, Sitecore validates the authenticity of the alert and creates a support ticket to investigate the issue, and informs the customer about the initial investigation. If the issue turns out to be the result of any kind of disaster that means that the primary resource group cannot be recovered temporarily, Sitecore will start the failover process, after approval from the customer or without approval from the customer, according to Service type. The failover process provides the customer with a secondary environment with which they can continue business-critical activities until the primary environment becomes available.

Notes:

Prerequisites

The following prerequisites are common for all the disaster recovery options:

  1. The customer's Sitecore Managed Cloud solution must be compliant with the compatibility requirements described in Sitecore Managed Cloud Standard – compatibility tables.
  2. The customer is eligible to request the Disaster Recovery feature only if it is purchased within the Managed Cloud contract.
  3. The customer must have a valid Sitecore license file, Sitecore certificate, and password when requesting the disaster recovery setup from Sitecore Support.
  4. Sitecore Solution running on Azure with:
    • Supported Versions: 9.1.0, 9.1.1, 9.2.0, 9.3.0, 10.0.0, 10.0.1, 10.0.2, 10.0.3, 10.1.0, 10.1.1, 10.1.2, 10.1.3, 10.2.0, 10.2.1, 10.3.0, 10.3.1.
    • Supported Deployment Sizes: Extra Small, Small, Medium, Large, and Extra Large.
    • Supported topologies: XM, XP.
    • Supported Search Services: Azure Search or SearchStax (Solr).

Technical prerequisites

  1. Connection Strings and App Settings:
    • Web Apps has database connection strings in ConnectionString.config in App_Config:
      • XP topology must contain connection strings in "cd", "cm", "cortex-processing", "cortex-reporting", "ma-ops", "ma-rep", "prc", "rep", "xc-collect", "xc-refdata", "xc-search" roles.
      • XM topology must contain connection strings in "cd" and "cm" roles.
    • cortext-processing role for Sitecore XP Version 9.1.0 and above contains ConnectionString.config and AppSettings.config in App_Data/jobs/continuous/ProcessingEngine/App_Config".
    • ma-ops role for Sitecore XP Version 9.1.0 and above contains ConnectionString.config and AppSettings.config in App_Data/jobs/continuous/AutomationEngine/App_Config".
    • xc-search role for Sitecore XP Version 9.1.0 and above contains ConnectionString.config and AppSettings.config in App_Data/jobs/continuous/IndexWorker/App_Config".
    • XP topology web apps has AppSettings.config in App_Config for cortex-processing", "cortex-reporting", "ma-ops", "xc-collect", "xc-refdata", "xc-search"roles for XP.
  2. Identity Server
    • SI role for Sitecore 9.1.0 and above contains Sitecore.IdentityServer.Host.xml in Config/production.
    • CM and CD for Sitecore 9.1.0 and above contains Sitecore.Owin.Authentication.IdentityServer.config in App_Config/Sitecore/Owin.Authentication.IdentityServer.
  3. XDb config
    • CM role for Sitecore 9.0.0 and 9.0.1 contains Sitecore.Xdb.Remote.Client.CM.config in App_Config/Sitecore/Marketing.xDB.
    • CM role for Sitecore 9.0.2 and above contains Sitecore.Xdb.Remote.Client.CM.config in App_Config/Sitecore/Azure.
  4. CD WebApp has license.xml in App_Data.
  5. Ensure that SQL Server Geo-Replication is NOT enabled in primary.

Disaster Recovery features

Sitecore has two Disaster Recovery features:

You can raise a support query for detailed information on specific Disaster Recovery features on the Sitecore Support Portal.

Considerations

Disaster recovery introduces some new considerations when you build a Sitecore XP/XM solution. This section tries to address some of the most common ones.

SQL Server geo-replication and failover group

Disaster Recovery uses Azure's Geo-replication for SQL Server. There is a limitation where databases cannot be added to multiple failover groups. Therefore, the existing failover groups need to be removed.

This is applicable for customers with a Sitecore environment that uses failover groups for their primary SQL server databases before setting up the disaster recovery.

Make sure you have updated and verified the below steps before proceeding with the Disaster Recovery setup:

  1. If the failover group endpoint is used as the connection endpoint for apps/services (for example, in the connectionString.config file), it needs to be changed to refer to the SQL server instead.
  2. Remove the existing failover groups configured in the primary SQL Server.

Choosing your Azure Region

Azure organizes its data centers into regions with a latency-defined perimeter and is connected through a dedicated regional low-latency network. When choosing a secondary data center, we recommend choosing one in the same region as the primary, to ensure fast backups and consistent customer delivery speeds. To find compatible regions, see the article here.

Azure Region for Control Resource Group

Note that there is no strict or specific rule to select the region for the Control Resource Group. There are considerations as below:

Third-party service APIs

If the Sitecore implementation is using any third-party service APIs that limit access based on IP, then it is essential to register the IPs of the secondary data center with the service. Failure to register the IPs could result in a delay to bring the secondary Sitecore environment online.

Outage page

Managed Cloud uses Azure Functions to serve an outage page in case of an outage. Using Azure Functions means the outage page will return a 503 code to indicate the service is unavailable. We recommend that the outage page only contains the necessary information to assure customers that the site will be back online soon, for example:

The outage page is simply constructed with the basic text by default, you can request temporary access (by creating a support case) to the Outage app to customize your outage app based on your requirements.

Access

We can grant temporary access to the Traffic Manager, Function App (serves the outage page), and Secondary Resource Group if you would like to update your resources with a custom configuration, custom domains, and so on.

Limitations

This section describes limitations to the Disaster Recovery options provided by Managed Cloud.

No removal of Control Resource Group

The Control Resource Group contains all resources used to restore the Sitecore XP/XM environment successfully in a secondary data center. Deleting the Control Resource Group or its resources can lead to the inability to perform successful recovery.

List of files that are excluded while performing backup for Disaster recovery

DR setup configures a backup process to back up all the web apps to meet DR failover needs. In order to achieve this, there are certain files in the primary web apps that are excluded from the backup. The following table describes the exclusion (applicable for Sitecore XP/XM 9.1 Initial Release and above):

File Topology Roles Details
\site\wwwroot\App_Data\logs
\site\wwwroot\App_Data\debug
\site\wwwroot\App_Data\diagnostics
\site\wwwroot\App_Data\MediaCache
\site\wwwroot\App_Data\packages
\site\wwwroot\App_Data\viewstate
\site\wwwroot\App_Data\DeviceDetection
\site\wwwroot\temp
* *

Temp/log files.
No backup because logs are usually in large sizes that impact the backup duration and cost.

\site\wwwroot\bin\Feature.HADR_PublishAPI.dll
\site\wwwroot\bin\Foundation.HADR_WebApi.dll
* CM HADR related API files.

 

xDB is excluded while considering the recovery time

The Recovery time needed while doing the failover process does not cover the xDB rebuild due to the significant amount of time it can take for a large content database. If the analytics indexes are not rebuilt this should only affect the functionality that depends on lists (for example, EXM) and should not affect the frontend site.

No recovery of certain infrastructure customizations during failover

Additional Azure resources or services that are added to the environment and are not part of the standard Managed Cloud topology are not recovered during the failover process. These components must be added separately to the secondary environment after the failover process has been completed.

xConnect Search Indexer

Sitecore XP can only have one active xConnect Search Indexer WebJob across a solution. In case of any failover and restoration of service, the indexer must be shut down.

Azure requirements and cost considerations

All disaster recovery options are dependent on Azure WebApp Backup and Traffic Manager, which require a minimum of the Standard Tier for WebApps.

Failover situations are not supported

There is a small set of situations where it might not be possible to restore a production site into the secondary data center. For example, when a global Azure service such as authentication or Traffic Manager is down.

Azure Service Bus is not supported

HADR does not support Azure Service Bus Synchronization, Backup/Restore, or Replication. This is applicable for Sitecore XP/XM 9.2.0 and Sitecore XP/XM 9.3.0 only.

Customized resources are not supported

Custom resources and resource configurations other than standard Sitecore Managed Cloud Standard resources are not supported. This includes WAF, AFD, Traffic Manager, Storage Account, and SQL elastic pool in the primary resource group. Custom synchronization, Backup, Restore, or Replication are not supported.

Custom domains or SSL binding are not updated

While performing failover, HADR does not update the custom domains or SSL binding on the Outage app or on the secondary web app.

Disaster recovery testing is not supported

Sitecore does not support the testing of Disaster Recovery scenarios for customer implementations at this time.

Sitecore additional modules and custom configurations

Sitecore in Managed Cloud provides options to include modules and configure additional services. However, the Sitecore Disaster Recovery solution does not provide full support to the modules out of the box. Below are the details related to the disaster recovery support nature for modules and configurations.

Sitecore Experience Accelerator (SXA)

SXA is support by default by Disaster Recovery. SXA related configuration are synchronized to the secondary environment.

Sitecore JavaScript Services Server (JSS)

JSS is support by default by Disaster Recovery. JSS related binaries, configurations and items are synchronized to the secondary environment.

Other modules

Modules listed below are not supported by default and need to be configured manually for secondary.
These modules that have been installed during the primary Sitecore XP/XM installation (that is, the primary environment of the customer) cannot be recovered during the failover process. These modules must be added separately after the failover process has been completed.
Ensuring the functionality of custom configuration as well as installed modules is not in the scope of the disaster recovery service. The custom solution should be designed so as to tolerate and handle the disaster recovery service steps that are described in the documentation.

Reverse Proxy

Reverse proxy is partially supported as our disaster recovery solution synchronizes most of the configuration from the primary environment.
In order to enable reverse proxy in secondary CD role, create the \home\site\applicationHost.xdt in the secondary similar to what has been created in the primary CD Role.

Additional support

Custom web app inclusion in backup/restore

Custom webapps are webapps that are not Sitecore roles that are not provisioned by default, i.e webapps that are created by customers or external vendors.
HADR will manage auto-scale settings, webapp service plan SKU, Azure level application settings and connection string, and web content synchronization (similar to Sitecore roles).
The content synchronization will not modify environment-based settings such as connectionstring.config, appsetting.config, and so on.
A user will have to modify it after the failover. A User can also modify _backup.filter after setup in order to achieve backup exclusion.

Prerequisites:

Custom Database Names

By default, DR supports the database names provisioned via Azure Marketplace or Sitecore Azure Toolkit with the predefined names.
In addition, DR also supports databases with custom names that follow the restriction provided by Azure at:

Resource naming restrictions - Azure Resource Manager | Microsoft Docs

Process description

DR setup

 

Action

Description

1

  Customer Checks prerequisites, Considerations, limitations, and additional support

  Check if the primary environment meets the prerequisites mentioned.

  Understand the areas in the consideration section that will be required when performing setup, failover, and failback.

  Understand the limitations listed in the document. If the limitation(s) has an impact on the environment, prepare an action plan.

  Review the additional support and how it will impact your customization. Take note that customizations done on the primary environment that are not listed in this document are not supported by DR.

2

  Customer Requests DR Setup

  Create a "DR Basic ColdStanby Setup" Service request

  or Create a "DR Managed HotStandby Setup" service request

3

  CloudOps does prerequisite and limitation check

  CloudOps will verify the primary environment for prerequisites, Considerations and limitations before provisioning DR

4

  CloudOps performs DR Setup

  CloudOps provisions DR

5

  CloudOps notifies Customer of DR setup status

  CloudOps will communicate and provide updates before, during, and after DR Setup

6

  Customer performs relevant DNS configuration DNS Provider and Traffic Manager

  As mentioned in the FAQ, the customer configures the custom domain of the CD instance to point to the DNS name of the traffic manager using a DNS CNAME record.

  As mentioned in the "Access" item in the Consideration section CloudOps will assist in providing temporary access to the relevant resources.

  CloudOps may assist in the configuration and verification process.

7

  Configures Outage App content

  As mentioned in the FAQ, configure the outage page according to the customer's specifications.

  As mentioned in the "Access" item in the Consideration section CloudOps will assist in providing temporary access to the relevant resources.

 

DR Failover for DR Basic Cold Standby

 

 

Action

Description

1

  Sitecore Support receives primary region unavailability/disaster alert.

  Our monitoring service in the control resource group will generate an alert for Sitecore Support to take action.

2

  Sitecore Support notifies Customers of the disaster.

 Customer receives notification with details of the disaster.

3

 Customer request for failover when the customer considers a failover is required    Customer provides approval for failover activity by creating a support case.

  Customer provides approval for failover activity by creating a support case.

4

  Sitecore Support performs failover and notifies customer on the failover status.

  Status update is provided via the created support case

5

  Optionally, the customer applies custom configurations or provisioning in the secondary environment.

Understand areas in the consideration section that will be required after performing failover.

Understand the limitations listed in the document. If the limitations have an impact on the environment, execute the action plan prepared prior to DR Setup.

Review the additional support and how it will impact your customization. Take note that customizations done on the primary environment that are not listed in this document are not supported.

 

DR Failback for DR Basic Cold Standby

 

Action

Description

1

  Sitecore Support receives primary region availability/recovery alert.

  Our monitoring service in the control resource group will generate an alert for Sitecore Support to take action.

2

 Sitecore Support notifies Customer of the recovery and request for approval for failback activity.

  Notification is via a support case created during failover activity.

3

  Customer provides approval for failback.

 Approval provided via support case created during failover activity.

4

  Sitecore Support performs failback and notifies customer on the failback status.

 Status update is provided via the created support case.

 

DR Failover for DR Managed Hot Standby

 

Action

Description

1

  CloudOps receives primary region unavailability/disaster alert.

  Our monitoring service in the control resource group will generate an alert for CloudOps to monitor the failover and take actions when required.

2

  Failover is executed automatically.

 

3

  CloudOps notifies customers of the failover status.

 

4

  Optionally, the customer applies custom configurations or provisioning in the secondary environment.

  Understand the areas in the consideration section that will be required after performing the failover.

  Understand the limitations listed in the document. If the limitation(s) has an impact on the environment, execute the action plan prepared prior to DR Setup.

  Review the additional support and how it will impact your customization. Take note that customizations done on the primary environment that is not listed in this document are not supported.

 

DR Failback for DR Managed Hot Standby

 

Action

Description

1

  CloudOps receives primary region availability/recovery alert.

  Our monitoring service in the control resource group will generate an alert for CloudOps to monitor the failover and take actions when required.

2

  Failback is executed automatically.

 

3

  CloudOps notifies customers of the failback status.

 

Changes done to the primary Sitecore environment after DR Setup

These are the changes that will be applied to the Primary Sitecore environment during DR Setup. They are required to enable smooth DR operations such as failover and failback.

Database connection strings in Sitecore web apps

The data source of the connection string will be changed from the primary SQL server to the Failover Group endpoint during DR setup. This is used to enable the capability to perform failover.

Sitecore roles related changes

Database connection strings are found in App_Config\ConnectionStrings.config.

For example,

<add name="security" connectionString="Data Source=primary-sql.database.windows.net" />
will be changed to
<add name="security" connectionString="Data Source=primary-fg.database.windows.net" />

Identity Server

The identity server role in both XP and XM topology will get updated in Config\production\Sitecore.IdentityServer.Host.xml file at \Settings\Sitecore\IdentityServer\SitecoreMembershipOptions\ConnectionString.

Additional update

Database connection strings in cortex-processing, ma-ops, xc-search roles in Sitecore version 9.1.0 and above will be updated. These are applied to App_Data\jobs\continuous<specific-name>\App_Config\ConnectionStrings.config.
ProcessingEngine for cortex-processing
AutomationEngine for ma-ops
IndexWorker for xc-search

Azure Search

The connection string will consist of primary and secondary Azure Search URL.

The connection string for Azure Search (cloud.search) will get updated in App_Config\ConnectionStrings.config.

For example,

<add name="cloud.search" connectionString="serviceUrl=https://primary-as.search.windows.net;apiVersion=2017-11-11;apiKey=F377288DE1D8549E5338AEA836DF7BE6" />
will be updated to
<add name="cloud.search" connectionString="serviceUrl=https://primary-as.search.windows.net;apiVersion=2017-11-11;apiKey=abc123|serviceUrl=https://secondary-as.search.windows.net;apiVersion=2017-11-11;apiKey=abc1234" />

Hotfix Patching

For CD, CM, rep and prc roles in Sitecore version 9.1.0, 9.1.1 and 9.2.0 will have a hotfix patch (Sitecore.ContentSearch.Azure.dll) in site\wwwroot\bin.

IndexAPI config

Below files will be uploaded to the primary CM role:

WebApi.config will be uploaded in site\wwwroot\App_Config\Include\Sitecore.ContentIndexing.WebApi.

Sitecore.ContentIndexing.WebApi.dll will be uploaded in site\wwwroot\bin. This is for DR use only.

FAQ

How do customers request the Managed Cloud Disaster Recovery (DR) feature for Sitecore Managed Cloud environments?
The customer can ask to set up Disaster Recovery for their XP/XM environment through the Sitecore regional office or Sitecore sales team.

What actions do customers need to take after the DR setup has been done?
After the DR setup has been complete, customers are requested to perform the following actions:

The instructions for how to do so are provided by Sitecore engineers after the provision of the DR setup. Alternatively, the customer can raise a support query for detailed information on the Sitecore Support Portal.

What are the new resources that are introduced once the DR setup has been done?
Post provision of DR setup, the customer is able to see the following resource groups according to the chosen DR type:

Do customers have limited access rights to the DR resources?
Sitecore provides limited access to customers on additional resource groups (Control and Secondary). This helps Sitecore to prevent any changes to the configurations related to backup policies and automation.

How is the paired region chosen for the DR setup?
Sitecore chooses the best-paired region for our customers that complies with Microsoft's standards. More detailed descriptions are provided here.

Can I update the default outage page?

Yes, you can request temporary access to the Outage function app (with a Sitecore support case) and update your default outage page.

Will everything from my Primary resource group be available after the Failover?

No, we will restore only Standard Managed Cloud resources (Webapps, SQL DB, Search service). Review the limitation section for the current limitations of HADR.

Do the custom domains and SSL bindings replicate from the Primary environment during the failover?
No, all the required custom domains and SSL binding must be added by the customer to the secondary resources by requesting temporary access (with a Sitecore support case).

What is the procedure of enabling DR for SolrCloud?

Sitecore follows the following procedures while enabling Disaster Recovery setup for Managed Cloud customers who have purchased Managed Cloud instances with SolrCloud, to provide DR availability for both.

Why are secondary web apps stopped after DR is provisioned?

Sitecore has underlying services that would access and update information in the databases. Because primary Sitecore roles are alive and performing these actions, we want to avoid secondary roles attempting any processing and/or updating information in the databases.