Monitoring of Managed Cloud solution consists of the following services:
Find the Monitoring model on the following infrastructure diagram:
Grafana is integrated with Azure Active Directory, and Basic Authentication is disabled. It is required to choose the "Sign in with Microsoft" authentication option and use your Microsoft work account.
Note: Access to Grafana is configured for a particular Active Directory group. The account used for authentication must be included in this group. Therefore, the service request must be intended to include a particular account in the group.
Navigate to the search menu to observe all dashboards:
Dashboard |
Description |
Container overview |
List of all containers with information about container Namespace and Pod. You can see the status of each container, and the total number of healthy/unhealthy stopped containers. |
Host Disk Overview (Linux only) |
Exposes node filesystem and disk I/O metrics like Read-Write time spent, Filesystem available space, and so on. |
Host Disk Overview (Windows only) |
Filesystem available space. |
Ingress Overview |
Ingress metrics for each Sitecore role and Grafana. |
Kubernetes Cluster |
A high-level overview of cluster. |
Kubernetes Pod Overview |
Exposes memory and CPU request, limits, and utilization per Pod for all namespaces including the system. It is shown live logs. |
Linux Node Overview |
Detailed information about Memory/CPU/Disk utilization for each Linux Node. |
MsSql Elastic Pool |
Detailed information about MsSql Elastic Pool utilization. |
Redis Server Overview |
Exposes general Redis metrics. The same as native Redis "INFO" command. |
Windows Node Overview |
Detailed information about Memory/CPU/Disk utilization for each Windows Node. |
Description |
Condition |
Resource |
Period |
Node statistic |
|
|
|
Memory percentage is >95% |
Node memory utilization is > 95% |
K8s node |
10m |
CPU percentage is >95% |
CPU load is > 95% |
K8s node |
10m |
Infrastructure |
|
|
|
Pod is not ready for 30m |
Pod status != ready |
K8s Pod |
30m |
Kubelet is down |
Job "kubelet" is down for the last 15m |
K8s Job |
15m |
Pod is restarting frequently |
Pod being restarted at least once per 5 minutes |
K8s Pod |
1h |
Deployment generation mismatch |
Deployment has failed but has not been rolled back. |
K8s deployment |
15m |
Deployment replicas mismatch |
Deployment has not matched the expected number of replicas for longer than an hour. |
K8s deployment |
1h |
DaemonSet pods not ready |
Not all of the desired pods are scheduled and ready |
K8s daemonset |
15m |
DaemonSet pods not scheduled |
Not all of the desired pods are scheduled |
K8s daemonset |
10m |
DaemonSet pods misscheduled |
Pods of DaemonSet are running where they are not supposed to run. |
K8s daemonset |
1h |
CPU Throrrling is high |
Pod CPU throttling > 25% |
K8s Pod |
15m |
Warning events occured |
One or more events of type Warning occurred in the namespace |
K8s namespace |
1h |
Node is not ready |
Node is not ready |
K8s node |
1h |
Kubernetes version mismatch |
Different semantic versions of Kubernetes components running |
K8s |
1h |
Kubernetes API server client is experiencing errors |
>1 errors in Kubernetes API server |
K8s |
5m |
Node is running out of pods capacity |
Node pods capacity >95% |
K8s node |
15m |
Disk space is used for > 90% |
Node disk space is used for > 90% |
K8s node |
1h |
Sitecore roles |
|
|
|
Http request is 5xx >10 |
5xx HTTP response > 10 |
nginx_ingress_controller |
10m |
Average page response time > 1 second |
The average response time is more than 1 sec |
nginx_ingress_controller |
30m |
Average page response time > 30 seconds |
The average response time is more than 30 sec |
nginx_ingress_controller |
5m |
Availability tests are on /sitecore/service/keepalive.aspx |
Availability Tests on /sitecore/service/keepalive.aspx failed |
Sitecore pod |
5m |
Redis Cache |
|
|
|
Average number of connected clients in % are > 80% |
Number of connected clients is > 80% comparing to redis_config_maxclients |
Redis Cache |
30m |
The server load is >95% |
Percent Processor load is > 95% over the last 30 mins for Redis |
Redis Cache |
30m |
MSSQL elastic pool |
|
|
|
Database throughput unit (DTU) is >95% |
Average throughput unit (DTU) > 95% |
|
5m |
Storage percentage is >75% |
Average Storage percentage >75% |
|
5m |
CPU is >95% |
Average CPU usage > 95% |
|
5m |
SQL Databases Deadlock |
Database is deadlocked |
|
- |
Data IO percentage is >95% |
Average Data IO percentage > 95% |
|
5m |
Log IO percentage is >95% |
Average Log IO percentage > 95% |
|
5m |
Workers percentage is >95% |
Maximum workers percentage >95% |
|
5m |
Concurrent sessions supported by the DB tier is > 95% |
Maximum concurrent sessions supported by the DB tier is > 95% |
|
5m |
Number of failed database connections > 5 |
The database has 5 failure connections over the last 5 min |
|
5m |
Average In-Memory OLTP storage > 95% |
Average In-Memory OLTP storage > 95% |
|
30m |