Viewing KUMA metrics

To monitor the performance of KUMA services and the flow of events and correlations, KUMA collects and stores a large amount of data. The VictoriaMetrics time series database is used to collect, store and analyze the data.

In the KUMA Console, in the Metrics section, you can find dashboards with information about metrics and visualization of key performance indicators of various KUMA services. The collected metrics are visualized using the Grafana solution. Selecting the Metrics section opens the automatically updated Grafana portal deployed as part of KUMA Core installation process.

The KUMA Core service configures VictoriaMetrics and Grafana automatically, user participation is not required. Graphs in the Metrics section appear with a delay of approximately 1.5 minutes. If the Metrics section shows "core:<port number>", this means that the metrics were received from the host on which the KUMA Core was installed. In other configurations, the name of the host from which KUMA receives metrics is displayed.

Collector metrics

IO—metrics related to the service input and output

Normalization—metrics related to the normalizers

Metric name	Description
Raw & Normalized event size	The size of the raw event and size of the normalized event. The median value is displayed.
Errors	The number of normalization errors per second.

Filtration—metrics related to filters

Metric name	Description
EPS	The number of events per second matching the filter conditions and sent for processing. The collector only processes events that match the filtering criteria if the user has added the filter to the configuration of the collector service.

Aggregation—metrics related to the aggregation rules

Metric name	Description
EPS (events per second)	The number of events received and generated by the aggregation rule per second. This metric helps determine the effectiveness of aggregation rules.
Buckets	The number of buckets in the aggregation rule.

Enrichment—metrics related to enrichment rules

Metric name	Description
Cache RPS	The number of requests per second to the local cache.
Source RPS	The number of requests per second to an enrichment source, such as a dictionary.
Source Latency	Time in milliseconds passed while sending a request to the enrichment source and receiving a response from it. The median value is displayed.
Queue	The size of the enrichment request queue. This metric helps to find bottleneck enrichment rules.
Errors	The number of errors per second while sending requests to the enrichment source.

Process—metrics related to processes

Metric name	Description
Memory	RAM usage (RSS) in megabytes.
Disk BPS	The number of bytes read from or written to the disk per second.
Network BPS	The number of bytes received/transmitted over the network per second.
Network Packet Loss	The number of network packets lost per second.
GC Latency	The time, in milliseconds, spent executing a GO garbage collection cycle The median value is displayed.
Goroutines	The number of active goroutines. This number is different from the operating system's thread count.

Metric name	Description
Load	Average load.
CPU	CPU load as a percentage.
Memory	RAM usage (RSS) as a percentage.
Disk	Disk space usage as a percentage.

Correlator metrics

Metric name	Description
Processing EPS	Number of events processed per second.
Output EPS	Number of events per second sent to the destination.
Output Latency	The time in milliseconds that passed while sending an event packet and receiving a response from the destination. The median value is displayed.
Output Errors	The number of errors occurring per second while sending event packets to the destination. Network errors and errors writing to the disk buffer of the destination are displayed separately.
Output Event Loss	Number of events lost per second. Events can be lost due to network errors or errors writing the disk buffer of the destination. Events are also lost if the destination responds with an error code, for example, in case of an invalid request.
Output Disk Buffer Size	The size of the disk buffer of the collector associated with the destination, in bytes. If a zero value is displayed, no event packets have been placed in the collector's disk buffer and the service is operating correctly.
Write Network BPS	Number of bytes transmitted into the network per second.

Correlation—metrics related to correlation rules

Metric name	Description
EPS	The number of correlation events per second generated by the correlation rule.
Buckets	The number of buckers in a correlation rule of the standard type.
Rate Limiter Hits	The number of times the correlation rule exceeded the rate limit per second.
Active Lists OPS	The number of operations requests per second sent to the active list, and the operations themselves.
Active Lists Records	The number of records in the active list.
Active Lists On-Disk Size	The size of the active list on the disk, in bytes.
Context Tables OPS	Number of queries to the context table in 1 second with the operation indicated.
Context Tables Records	Current number of records in the context table
Context Tables On-Disk Size	Current size of context table on disk.

Enrichment—metrics related to enrichment rules

Metric name	Description
Cache RPS	The number of requests per second to the local cache.
Source RPS	The number of requests per second to an enrichment source, such as a dictionary.
Source Latency	Time in milliseconds passed while sending a request to the enrichment source and receiving a response from it. The median value is displayed.
Queue	The size of the enrichment request queue. This metric helps to find bottleneck enrichment rules.
Errors	The number of errors per second while sending requests to the enrichment source.

Response—metrics associated with response rules

Metric name	Description
RPS	The number of times a response rule was activated per second.

Process—metrics related to processes

Metric name	Description
Memory	RAM usage (RSS) in megabytes.
Disk BPS	The number of bytes read from or written to the disk per second.
Network BPS	The number of bytes received/transmitted over the network per second.
Network Packet Loss	The number of network packets lost per second.
GC Latency	The time, in milliseconds, spent executing a GO garbage collection cycle The median value is displayed.
Goroutines	The number of active goroutines. This number is different from the operating system's thread count.

Metric name	Description
Load	Average load.
CPU	CPU load as a percentage.
Memory	RAM usage (RSS) as a percentage.
Disk	Disk space usage as a percentage.

Storage metrics

Metric name	Description
Output EPS	Number of events per second that were inserted into the ClickHouse node or into the disk buffer of the node.
Output Latency	Latency of inserting a batch either into the ClickHouse node or into the disk buffer of the node.
Output Errors	Number of errors per second when inserting a batch into the ClickHouse node or into the disk buffer of the node.
Output Disk Buffer Size	Size of the disk buffer of the ClickHouse node.
Output Event Loss	Number of events that were permanently lost on each ClickHouse node.
Batch size	The size of the batch of events to be inserted into the ClickHouse node.
Insert interval 5min Q1	Duration of the first quartile of the event insertion interval for the last 5 minutes.

ClickHouse / General—metrics related to the general settings of the ClickHouse cluster

Metric name	Description
Active Queries	The number of active queries sent to the ClickHouse cluster. This metric is displayed for each ClickHouse instance.
QPS	The number of queries per second sent to the ClickHouse cluster.
Failed QPS	The number of failed queries per second sent to the ClickHouse cluster.
Allocated memory	The amount of RAM allocated to the ClickHouse process (depends on the specifications of the server and can be expressed, for example, in GB or MB).
Active parts	The number of active parts. Active parts are data (files on disk) that are currently being used to process queries.
Detached parts (count)	The number of detached (disconnected) parts. Detached parts are data that exists on disk, but do not participate in file read and write operations.
Detached parts (size)	Disk space taken up by detached parts. You can specify the maximum size of detached parts in the range from 1% to 90%. The default setting is 1%. If the size of detached parts exceeds the specified maximum value, KUMA assigns the yellow status to the running storage service in the Active services section.

ClickHouse / Insert—metrics related to inserting events into a ClickHouse instance

Metric name	Description
Insert EPS	The number of events per second inserted into the ClickHouse instance.
Insert QPS	The number of ClickHouse instance insert queries per second sent to the ClickHouse cluster. If the Insert QPS metric demonstrates a growing queue of queries and the metric is greater than 1, we recommend also looking at the Batch size metric value to adjust the storage buffering settings in the settings of the storage service configuration. Example: The Insert QPS metric is greater than 1 and equals 8. The Batch size metric is 1.2 GB (in bytes). In this case, you can find the buffer size by multiplying Insert QPS by Batch size: 8 * 1.2 = 9.6 GB. Round the resulting value of 9.6 and specify it in bytes (for example, 10000000000 bytes) as the Buffer size setting on the Advanced settings tab of the storage service configuration. Also specify a Buffer flush interval of 2 seconds. Increasing the buffer size and buffer flush interval will help relieve the query queue. Normally, the Insert QPS metric should be less than 1.
Failed Insert QPS	The number of failed ClickHouse instance insert queries per second sent to the ClickHouse cluster.
Delayed Insert QPS	The number of delayed ClickHouse instance insert queries per second sent to the ClickHouse cluster. Queries were delayed by the ClickHouse node due to exceeding the soft limit on active merges.
Rejected Insert QPS	The number of rejected ClickHouse instance insert queries per second sent to the ClickHouse cluster. Queries were rejected by the ClickHouse node due to exceeding the hard limit on active merges.
Active Merges	The number of active merges.
Distribution Queue	The number of temporary files with events that could not be inserted into the ClickHouse instance because it was unavailable. These events cannot be found using search.

ClickHouse / Select—metrics related to event selections in the ClickHouse instance

Metric name	Description
Select QPS	The number of ClickHouse instance event select queries per second sent to the ClickHouse cluster.
Failed Select QPS	The number of failed ClickHouse instance event select queries per second sent to the ClickHouse cluster.

ClickHouse / Replication—metrics related to replicas of ClickHouse nodes

Metric name	Description
Active Zookeeper Connections	The number of active connections to the Zookeeper cluster nodes. In normal operation, this number should be equal to the number of nodes in the Zookeeper cluster.
Read-only Replicas	The number of read-only replicas of ClickHouse nodes. In normal operation, no such replicas of ClickHouse nodes must exist.
Active Replication Fetches	The number of active processes of downloading data from the ClickHouse node during data replication.
Active Replication Sends	The number of active processes of sending data to the ClickHouse node during data replication.
Active Replication Consistency Checks	The number of active data consistency checks on replicas of ClickHouse nodes during data replication.

ClickHouse / Networking—metrics related to the network of the ClickHouse cluster

Metric name	Description
Active HTTP Connections	The number of active connections to the HTTP server of the ClickHouse cluster.
Active TCP Connections	The number of active connections to the TCP server of the ClickHouse cluster.
Active Interserver Connections	The number of active service connections between ClickHouse nodes.

Agents

Metric name	Description
Memory	RAM usage (RSS) in megabytes.
Disk BPS	The number of bytes read from or written to the disk per second.
Network BPS	The number of bytes received/transmitted over the network per second.
Network Packet Loss	The number of network packets lost per second.
GC Latency	The time, in milliseconds, spent executing a GO garbage collection cycle The median value is displayed.
Goroutines	The number of active goroutines. This number is different from the operating system's thread count.

Metric name	Description
Load	Average load.
CPU	CPU load as a percentage.
Memory	RAM usage (RSS) as a percentage.
Disk	Disk space usage as a percentage.
Disk used (events)	Percentage of /opt filled with events.
Disk used (except events)	Percentage of /opt filled with data other than events.

KUMA Core metrics

Metric name	Description
Lookup RPS	Number of lookup procedure requests per second sent to the KUMA Core, and the procedures themselves.
Lookup Latency	Time in milliseconds spent running the lookup procedures, and the procedures themselves. The time is displayed for the 99th percentile of lookup procedures. One percent of lookup procedures may take longer to run.
Propose RPS	Number of Raft (SQLite) propose procedure requests per second sent to the KUMA, and the procedures themselves.
Propose Latency	Time in milliseconds spent running the Raft (SQLite) propose procedures, and the procedures themselves. The time is displayed for the 99th percentile of propose procedures. One percent of propose procedures may take longer to run.

Data mining—metrics related to data collection and analysis

Metric name	Description
Executing Rules	Number of running schedulers for executing data collection and analysis requests.
Queued Rules	Number of schedulers for executing queued data collection and analysis requests.
Execution Errors	Number of errors that occurred when running the data collection and analysis scheduler.
Execution Latency	How long takes a scheduler to execute requests.

Tasks—metrics related to monitoring the running of tasks on the KUMA Core

Metric name	Description
Active tasks	Number of tasks run per unit time.
Task Execution latency	Duration of running tasks in seconds.
Errors	Number of errors when running tasks.

API—metrics related to API requests

Metric name	Description
RPS	Number of API requests made to the KUMA Core per second.
Latency	Time in milliseconds spent processing a single API request to the KUMA Core. The median value is displayed.
Errors	Number of errors per second while sending API requests to the KUMA Core.

Notification Feed—metrics related to user activity

Metric name	Description
Subscriptions	Number of clients connected to the KUMA Core via SSE to receive server messages in real time. This number is normally equal to the number of clients that are using the KUMA web interface.
Errors	Number of errors per second while sending notifications to users.

Schedulers—metrics related to KUMA Core tasks

Metric name	Description
Active	The number of repeating active system tasks. The tasks created by the user are ignored.
Latency	The time in milliseconds spent running the task. The median value is displayed.
Errors	Number of errors that occurred per second while performing tasks.
Alerts Queue	Number of alerts in the queue for insertion into the database.

Metric name	Description
Output Disk Buffer Size	The size of each of disk file buffer on the KUMA Core node for the tasks for sending audit events, monitoring, collecting and analyzing data for correlators.

Process—metrics related to processes

Metric name	Description
Memory	RAM usage (RSS) in megabytes.
Disk BPS	The number of bytes read from or written to the disk per second.
Network BPS	The number of bytes received/transmitted over the network per second.
Network Packet Loss	The number of network packets lost per second.
GC Latency	The time, in milliseconds, spent executing a GO garbage collection cycle The median value is displayed.
Goroutines	The number of active goroutines. This number is different from the operating system's thread count.

Metric name	Description
Load	Average load.
CPU	CPU load as a percentage.
Memory	RAM usage (RSS) as a percentage.
Disk	Disk space usage as a percentage.

KUMA agent metrics

Metric name	Description
Processing EPS	Number of events processed per second.
Output EPS	Number of events per second sent to the destination.
Output Latency	The time in milliseconds that passed while sending an event packet and receiving a response from the destination. The median value is displayed.
Output Errors	The number of errors occurring per second while sending event packets to the destination. Network errors and errors writing to the disk buffer of the destination are displayed separately.
Output Event Loss	Number of events lost per second. Events can be lost due to network errors or errors writing the disk buffer of the destination. Events are also lost if the destination responds with an error code, for example, in case of an invalid request.
Output Disk Buffer Size	The size of the disk buffer of the collector associated with the destination, in bytes. If a zero value is displayed, no event packets have been placed in the collector's disk buffer and the service is operating correctly.
Write Network BPS	The number of bytes received into the network per second.

Process—metrics related to processes

Metric name	Description
Memory	RAM usage (RSS) in megabytes.
Disk BPS	The number of bytes read from or written to the disk per second.
Network BPS	The number of bytes received/transmitted over the network per second.
Network Packet Loss	The number of network packets lost per second.
GC Latency	The time, in milliseconds, spent executing a GO garbage collection cycle The median value is displayed.
Goroutines	The number of active goroutines. This number is different from the operating system's thread count.

Metric name	Description
Load	Average load.
CPU	CPU load as a percentage.
Memory	RAM usage (RSS) as a percentage.
Disk	Disk space usage as a percentage.

Event router metrics

Metric name	Description
Processing EPS	Number of events processed per second.
Output EPS	Number of events per second sent to the destination.
Output Latency	The time in milliseconds that passed while sending an event packet and receiving a response from the destination. The median value is displayed.
Output Errors	The number of errors occurring per second while sending event packets to the destination. Network errors and errors writing to the disk buffer of the destination are displayed separately.
Output Event Loss	Number of events lost per second. Events can be lost due to network errors or errors writing the disk buffer of the destination. Events are also lost if the destination responds with an error code, for example, in case of an invalid request.
Output Disk Buffer Size	The size of the disk buffer of the collector associated with the destination, in bytes. If a zero value is displayed, no event packets have been placed in the collector's disk buffer and the service is operating correctly.
Write Network BPS	The number of bytes received into the network per second.

Process—metrics related to processes

Metric name	Description
Memory	RAM usage (RSS) in megabytes.
Disk BPS	The number of bytes read from or written to the disk per second.
Network BPS	The number of bytes received/transmitted over the network per second.
Network Packet Loss	The number of network packets lost per second.
GC Latency	The time, in milliseconds, spent executing a GO garbage collection cycle The median value is displayed.
Goroutines	The number of active goroutines. This number is different from the operating system's thread count.

Metric name	Description
Load	Average load.
CPU	CPU load as a percentage.
Memory	RAM usage (RSS) as a percentage.
Disk	Disk space usage as a percentage.

Tenant metrics

Tenants Overview—metrics related to tenants

Metric name	Description
License EPS	Number of events per second received within the tenant.

Metrics storage period

KUMA operation data is saved for 3 months by default. This storage period can be changed.

To change the storage period for KUMA metrics:

Log in to the OS of the server where the KUMA Core is installed.
In the file /etc/systemd/system/multi-user.target.wants/kuma-victoria-metrics.service, in the ExecStart parameter, edit the --retentionPeriod=<metrics storage period, in months> flag by inserting the necessary period. For example, --retentionPeriod=4 means that the metrics will be stored for 4 months.
Restart KUMA by running the following commands in sequence:
1. systemctl daemon-reload
2. systemctl restart kuma-victoria-metrics

The storage period for metrics has been changed.

Page top