%kb_name - %short_descr - Knowledge Portal

Description

Monitoring of Managed Cloud solution consists of the following services:

Metrics Exporters - libraries that help to export metrics from services and infrastructure to Prometheus.
Prometheus – scrapes metrics from services, aggregates and stores data, and allows other services like Grafana to collect such metrics.
Grafana – collects metrics from Prometheus and provides a flexible way to visualize them.

Find the Monitoring model on the following infrastructure diagram:

Grafana is integrated with Azure Active Directory, and Basic Authentication is disabled. It is required to choose the "Sign in with Microsoft" authentication option and use your Microsoft work account.
Note: Access to Grafana is configured for a particular Active Directory group. The account used for authentication must be included in this group. Therefore, the service request must be intended to include a particular account in the group.

Finding Dashboards

Navigate to the search menu to observe all dashboards:

Grafana dashboards

Dashboard	Description
Container overview	List of all containers with information about container Namespace and Pod. You can see the status of each container, and the total number of healthy/unhealthy stopped containers.
Host Disk Overview (Linux only)	Exposes node filesystem and disk I/O metrics like Read-Write time spent, Filesystem available space, and so on.
Host Disk Overview (Windows only)	Filesystem available space.
Ingress Overview	Ingress metrics for each Sitecore role and Grafana.
Kubernetes Cluster	A high-level overview of cluster.
Kubernetes Pod Overview	Exposes memory and CPU request, limits, and utilization per Pod for all namespaces including the system. It is shown live logs.
Linux Node Overview	Detailed information about Memory/CPU/Disk utilization for each Linux Node.
MsSql Elastic Pool	Detailed information about MsSql Elastic Pool utilization.
Redis Server Overview	Exposes general Redis metrics. The same as native Redis "INFO" command.
Windows Node Overview	Detailed information about Memory/CPU/Disk utilization for each Windows Node.

Alerts

Description	Condition	Resource	Period
Node statistic
Memory percentage is >95%	Node memory utilization is > 95%	K8s node	10m
CPU percentage is >95%	CPU load is > 95%	K8s node	10m
Infrastructure
Pod is not ready for 30m	Pod status != ready	K8s Pod	30m
Kubelet is down	Job "kubelet" is down for the last 15m	K8s Job	15m
Pod is restarting frequently	Pod being restarted at least once per 5 minutes	K8s Pod	1h
Deployment generation mismatch	Deployment has failed but has not been rolled back.	K8s deployment	15m
Deployment replicas mismatch	Deployment has not matched the expected number of replicas for longer than an hour.	K8s deployment	1h
DaemonSet pods not ready	Not all of the desired pods are scheduled and ready	K8s daemonset	15m
DaemonSet pods not scheduled	Not all of the desired pods are scheduled	K8s daemonset	10m
DaemonSet pods misscheduled	Pods of DaemonSet are running where they are not supposed to run.	K8s daemonset	1h
CPU Throrrling is high	Pod CPU throttling > 25%	K8s Pod	15m
Warning events occured	One or more events of type Warning occurred in the namespace	K8s namespace	1h
Node is not ready	Node is not ready	K8s node	1h
Kubernetes version mismatch	Different semantic versions of Kubernetes components running	K8s	1h
Kubernetes API server client is experiencing errors	>1 errors in Kubernetes API server	K8s	5m
Node is running out of pods capacity	Node pods capacity >95%	K8s node	15m
Disk space is used for > 90%	Node disk space is used for > 90%	K8s node	1h
Sitecore roles
Http request is 5xx >10	5xx HTTP response > 10	nginx_ingress_controller	10m
Average page response time > 1 second	The average response time is more than 1 sec	nginx_ingress_controller	30m
Average page response time > 30 seconds	The average response time is more than 30 sec	nginx_ingress_controller	5m
Availability tests are on /sitecore/service/keepalive.aspx	Availability Tests on /sitecore/service/keepalive.aspx failed	Sitecore pod	5m
Redis Cache
Average number of connected clients in % are > 80%	Number of connected clients is > 80% comparing to redis_config_maxclients	Redis Cache	30m
The server load is >95%	Percent Processor load is > 95% over the last 30 mins for Redis	Redis Cache	30m
MSSQL elastic pool
Database throughput unit (DTU) is >95%	Average throughput unit (DTU) > 95%		5m
Storage percentage is >75%	Average Storage percentage >75%		5m
CPU is >95%	Average CPU usage > 95%		5m
SQL Databases Deadlock	Database is deadlocked		-
Data IO percentage is >95%	Average Data IO percentage > 95%		5m
Log IO percentage is >95%	Average Log IO percentage > 95%		5m
Workers percentage is >95%	Maximum workers percentage >95%		5m
Concurrent sessions supported by the DB tier is > 95%	Maximum concurrent sessions supported by the DB tier is > 95%		5m
Number of failed database connections > 5	The database has 5 failure connections over the last 5 min		5m
Average In-Memory OLTP storage > 95%	Average In-Memory OLTP storage > 95%		30m

Sitecore Managed Cloud Containers Monitoring

Description

Authentication in Grafana

Finding Dashboards

Grafana dashboards

Alerts