Sitecore Managed Cloud Containers Monitoring


Description

Monitoring of Managed Cloud solution consists of the following services:

Find the Monitoring model on the following infrastructure diagram:

Authentication in Grafana

Grafana is integrated with Azure Active Directory, and Basic Authentication is disabled. It is required to choose the "Sign in with Microsoft" authentication option and use your Microsoft work account.
Note: Access to Grafana is configured for a particular Active Directory group. The account used for authentication must be included in this group. Therefore, the service request must be intended to include a particular account in the group.

Finding Dashboards

Navigate to the search menu to observe all dashboards:

Grafana dashboards

Dashboard

Description

 Container overview

 List of all containers with information about container Namespace and Pod. You can see the status of each container, and the total number of healthy/unhealthy stopped containers.

 Host Disk Overview (Linux only)

 Exposes node filesystem and disk I/O metrics like Read-Write time spent, Filesystem available space, and so on.

 Host Disk Overview (Windows only)

 Filesystem available space.

 Ingress Overview

 Ingress metrics for each Sitecore role and Grafana.

 Kubernetes Cluster

 A high-level overview of cluster.

 Kubernetes Pod Overview

 Exposes memory and CPU request, limits, and utilization per Pod for all namespaces including the system. It is shown live logs.

 Linux Node Overview

 Detailed information about Memory/CPU/Disk utilization for each Linux Node.

 MsSql Elastic Pool

 Detailed information about MsSql Elastic Pool utilization.

 Redis Server Overview

 Exposes general Redis metrics. The same as native Redis "INFO" command.

 Windows Node Overview

 Detailed information about Memory/CPU/Disk utilization for each Windows Node.

Alerts

Description

Condition

Resource

Period

 Node statistic




 Memory percentage is >95%

 Node memory utilization is > 95%

K8s node

10m

 CPU percentage is >95%

 CPU load is > 95%

K8s node

10m

 Infrastructure




 Pod is not ready for 30m

 Pod status != ready

K8s Pod

30m

 Kubelet is down

 Job "kubelet" is down for the last 15m

K8s Job

15m

 Pod is restarting frequently

 Pod being restarted at least once per 5 minutes

K8s Pod

1h

 Deployment generation mismatch

 Deployment has failed but has not been rolled back.

K8s deployment

15m

 Deployment replicas mismatch

 Deployment has not matched the expected number of replicas for longer than an hour.

K8s deployment

1h

 DaemonSet pods not ready

 Not all of the desired pods are scheduled and ready

K8s daemonset

15m

 DaemonSet pods not scheduled

 Not all of the desired pods are scheduled

K8s daemonset

10m

 DaemonSet pods misscheduled

 Pods of DaemonSet are running where they are not supposed to run.

K8s daemonset

1h

 CPU Throrrling is high

 Pod CPU throttling > 25%

K8s Pod

15m

 Warning events occured

 One or more events of type Warning occurred in the namespace

K8s namespace

1h

 Node is not ready

 Node is not ready

K8s node

1h

 Kubernetes version mismatch

 Different semantic versions of Kubernetes components running

K8s

1h

 Kubernetes API server client is experiencing errors

 >1 errors in Kubernetes API server

K8s

5m

 Node is running out of pods capacity

 Node pods capacity >95%

K8s node

15m

 Disk space is used for > 90%

 Node disk space is used for > 90%

K8s node

1h

 Sitecore roles




 Http request is 5xx >10

 5xx HTTP response > 10

nginx_ingress_controller

10m

 Average page response time > 1 second

 The average response time is more than 1 sec

nginx_ingress_controller

30m

 Average page response time > 30 seconds

 The average response time is more than 30 sec

nginx_ingress_controller

5m

 Availability tests are on /sitecore/service/keepalive.aspx

 Availability Tests on /sitecore/service/keepalive.aspx failed

Sitecore pod

5m

 Redis Cache




 Average number of connected clients in % are > 80%

 Number of connected clients is > 80% comparing to redis_config_maxclients

Redis Cache

30m

 The server load is >95%

 Percent Processor load is > 95% over the last 30 mins for Redis

Redis Cache

30m

 MSSQL elastic pool




 Database throughput unit (DTU) is >95%

 Average throughput unit (DTU) > 95%


5m

 Storage percentage is >75%

 Average Storage percentage >75%


5m

 CPU is >95%

 Average CPU usage > 95%


5m

 SQL Databases Deadlock

 Database is deadlocked


-

 Data IO percentage is >95%

 Average Data IO percentage > 95%


5m

 Log IO percentage is >95%

 Average Log IO percentage > 95%


5m

 Workers percentage is >95%

 Maximum workers percentage >95%


5m

 Concurrent sessions supported by the DB tier is > 95%

 Maximum concurrent sessions supported by the DB tier is > 95%


5m

 Number of failed database connections > 5

 The database has 5 failure connections over the last 5 min


5m

 Average In-Memory OLTP storage > 95%

 Average In-Memory OLTP storage > 95%


30m