To determine on which host the Core is running, run the following command in the terminal of one of the controllers:
k0s kubectl get pod -n kuma -o wide
To monitor the performance of KUMA services and the flow of events and correlations, KUMA collects and stores a large amount of data. The VictoriaMetrics time series database is used to collect, store and analyze the data.
In the KUMA Console, in the Metrics section, you can find dashboards with information about metrics and visualization of key performance indicators of various KUMA services. The collected metrics are visualized using the Grafana solution. Selecting the Metrics section opens the automatically updated Grafana portal deployed as part of KUMA Core installation process.
To determine on which host the Core is running, run the following command in the terminal of one of the controllers:
k0s kubectl get pod -n kuma -o wide
The KUMA Core service configures VictoriaMetrics and Grafana automatically, user participation is not required. Graphs in the Metrics section appear with a delay of approximately 1.5 minutes. If the Metrics section shows "core:<port number>", this means that the metrics were received from the host on which the KUMA Core was installed. In other configurations, the name of the host from which KUMA receives metrics is displayed.
To determine on which host the Core is running, run the following command in the terminal of one of the controllers:
k0s kubectl get pod -n kuma -o wide
To determine on which host the Core is running, run the following command in the terminal of one of the controllers:
k0s kubectl get pod -n kuma -o wide
Collector metrics
IO—metrics related to the service input and output
The number of events per second matching the filter conditions and sent for processing. The collector only processes events that match the filtering criteria if the user has added the filter to the configuration of the collector service.
Number of events per second sent to the destination.
Output Latency
The time in milliseconds that passed while sending an event packet and receiving a response from the destination. The median value is displayed.
Output Errors
The number of errors occurring per second while sending event packets to the destination. Network errors and errors writing to the disk buffer of the destination are displayed separately.
Output Event Loss
Number of events lost per second. Events can be lost due to network errors or errors writing the disk buffer of the destination. Events are also lost if the destination responds with an error code, for example, in case of an invalid request.
Output Disk Buffer Size
The size of the disk buffer of the collector associated with the destination, in bytes. If a zero value is displayed, no event packets have been placed in the collector's disk buffer and the service is operating correctly.
Write Network BPS
Number of bytes transmitted into the network per second.
The number of active queries sent to the ClickHouse cluster. This metric is displayed for each ClickHouse instance.
QPS
The number of queries per second sent to the ClickHouse cluster.
Failed QPS
The number of failed queries per second sent to the ClickHouse cluster.
Allocated memory
The amount of RAM allocated to the ClickHouse process (depends on the specifications of the server and can be expressed, for example, in GB or MB).
Active parts
The number of active parts.
Active parts are data (files on disk) that are currently being used to process queries.
Detached parts (count)
The number of detached (disconnected) parts.
Detached parts are data that exists on disk, but do not participate in file read and write operations.
Detached parts (size)
Disk space taken up by detached parts.
You can specify the maximum size of detached parts in the range from 1% to 90%. The default setting is 1%.
If the size of detached parts exceeds the specified maximum value, KUMA assigns the yellow status to the running storage service in the Active services section.
The number of events per second inserted into the ClickHouse instance.
Insert QPS
The number of ClickHouse instance insert queries per second sent to the ClickHouse cluster.
If the Insert QPS metric demonstrates a growing queue of queries and the metric is greater than 1, we recommend also looking at the Batch size metric value to adjust the storage buffering settings in the settings of the storage service configuration.
Example:
The Insert QPS metric is greater than 1 and equals 8.
The Batch size metric is 1.2 GB (in bytes).
In this case, you can find the buffer size by multiplying Insert QPS by Batch size:
8 * 1.2 = 9.6 GB.
Round the resulting value of 9.6 and specify it in bytes (for example, 10000000000 bytes) as the Buffer size setting on the Advanced settings tab of the storage service configuration. Also specify a Buffer flush interval of 2 seconds. Increasing the buffer size and buffer flush interval will help relieve the query queue. Normally, the Insert QPS metric should be less than 1.
Failed Insert QPS
The number of failed ClickHouse instance insert queries per second sent to the ClickHouse cluster.
Delayed Insert QPS
The number of delayed ClickHouse instance insert queries per second sent to the ClickHouse cluster. Queries were delayed by the ClickHouse node due to exceeding the soft limit on active merges.
Rejected Insert QPS
The number of rejected ClickHouse instance insert queries per second sent to the ClickHouse cluster. Queries were rejected by the ClickHouse node due to exceeding the hard limit on active merges.
Active Merges
The number of active merges.
Distribution Queue
The number of temporary files with events that could not be inserted into the ClickHouse instance because it was unavailable. These events cannot be found using search.
The number of active connections to the Zookeeper cluster nodes. In normal operation, this number should be equal to the number of nodes in the Zookeeper cluster.
Read-only Replicas
The number of read-only replicas of ClickHouse nodes. In normal operation, no such replicas of ClickHouse nodes must exist.
Active Replication Fetches
The number of active processes of downloading data from the ClickHouse node during data replication.
Active Replication Sends
The number of active processes of sending data to the ClickHouse node during data replication.
Active Replication Consistency Checks
The number of active data consistency checks on replicas of ClickHouse nodes during data replication.
Number of lookup procedure requests per second sent to the KUMA Core, and the procedures themselves.
Lookup Latency
Time in milliseconds spent running the lookup procedures, and the procedures themselves. The time is displayed for the 99th percentile of lookup procedures. One percent of lookup procedures may take longer to run.
Propose RPS
Number of Raft (SQLite) propose procedure requests per second sent to the KUMA, and the procedures themselves.
Propose Latency
Time in milliseconds spent running the Raft (SQLite) propose procedures, and the procedures themselves. The time is displayed for the 99th percentile of propose procedures. One percent of propose procedures may take longer to run.
Number of clients connected to the KUMA Core via SSE to receive server messages in real time. This number is normally equal to the number of clients that are using the KUMA web interface.
Errors
Number of errors per second while sending notifications to users.
The size of each of disk file buffer on the KUMA Core node for the tasks for sending audit events, monitoring, collecting and analyzing data for correlators.
Number of events per second sent to the destination.
Output Latency
The time in milliseconds that passed while sending an event packet and receiving a response from the destination. The median value is displayed.
Output Errors
The number of errors occurring per second while sending event packets to the destination. Network errors and errors writing to the disk buffer of the destination are displayed separately.
Output Event Loss
Number of events lost per second. Events can be lost due to network errors or errors writing the disk buffer of the destination. Events are also lost if the destination responds with an error code, for example, in case of an invalid request.
Output Disk Buffer Size
The size of the disk buffer of the collector associated with the destination, in bytes. If a zero value is displayed, no event packets have been placed in the collector's disk buffer and the service is operating correctly.
Write Network BPS
The number of bytes received into the network per second.
Number of events per second sent to the destination.
Output Latency
The time in milliseconds that passed while sending an event packet and receiving a response from the destination. The median value is displayed.
Output Errors
The number of errors occurring per second while sending event packets to the destination. Network errors and errors writing to the disk buffer of the destination are displayed separately.
Output Event Loss
Number of events lost per second. Events can be lost due to network errors or errors writing the disk buffer of the destination. Events are also lost if the destination responds with an error code, for example, in case of an invalid request.
Output Disk Buffer Size
The size of the disk buffer of the collector associated with the destination, in bytes. If a zero value is displayed, no event packets have been placed in the collector's disk buffer and the service is operating correctly.
Write Network BPS
The number of bytes received into the network per second.
Number of events per second received within the tenant.
Metrics storage period
KUMA operation data is saved for 3 months by default. This storage period can be changed.
To change the storage period for KUMA metrics:
Log in to the OS of the server where the KUMA Core is installed.
In the file /etc/systemd/system/multi-user.target.wants/kuma-victoria-metrics.service, in the ExecStart parameter, edit the --retentionPeriod=<metrics storage period, in months> flag by inserting the necessary period. For example, --retentionPeriod=4 means that the metrics will be stored for 4 months.
Restart KUMA by running the following commands in sequence: