Monitor a node

This module will guide you through the steps to set up a monitoring tool that will allow you to track various metrics of an Octez node.

Introduction​

Until now, the only tool developers had to monitor the behavior of their Tezos node was to look at the logs, adjust the log verbosity, and reconstruct all relevant information from this stream. But getting more insight into a node's performance was tedious and difficult. For instance, the number of connected peers, the number of pending operations, or the number of times the validator switched branches, were not easy to observe continuously.

After a few iterations of different methods to gather node information and statistics that could be easily analyzed, we have recently chosen to include metrics within the node. With Octez Metrics it's simple to get a myriad of statistics about your node -- and quite efficiently so. You can also attach a Grafana dashboard to get a visual representation of how your node is performing. And with Grafazos, you can get customized ready-to-use dashboards for monitoring your Tezos node.

What metrics can we monitor ?

There are two types of metrics on an Octez node that can be monitored:

• hardware metrics
• node metrics

Here is an exhaustive list of Octez node metrics. Hardware metrics depend on your machine characteristics, but we will focus on performance metrics like the CPU usage, RAM, memory, etc.

Overview of monitoring tools​

What is Netdata ?​

Netdata is a light and open-source software that collects and exposes hardware metrics of a physical machine, and is capable of collecting fresh data every second. Netdata is a complete tool that provides various graphics visualization (histogram, tables, gauge, points...), alerting + triggering tools, and many other features...

What is Grafana ? ​

Basically, Grafana is a software that takes JSON in input and makes dashboards that you can display using your browser (the high-level web interface that displays all the metrics). In our case, JSONs are provided by Grafazos.

What is Grafazos ?​

Grafazos is a jsonnet library written by Nomadic Labs, which uses itself grafonet-lib, which is a jsonnet library to write Grafana dashboards as code.

What is jsonnet ? ​

Jsonnet is a programming language that allows to create JSONs easily.

What is Prometheus ? ​

Prometheus server is a toolkit that scrapes and stores time series data by making requests to the nodes. Basically, prometheus is fed by both netdata and tezos-metrics (which is deprecated and not used), and then, grafana displays the data gathered by prometheus.

Light monitoring of an Octez node with Netdata​

Set up the monitoring​

Step 1: Install Netdata on your node host

We recommend to look at "Install on Linux with one-line installer" to make a personnalised installation. Else, copy the following command:

wget -O /tmp/netdata-kickstart.sh https://my-netdata.io/kickstart.sh && sh /tmp/netdata-kickstart.sh
Step 2: Add the collector plugin for Octez metrics
This plugin scrapes the metrics of your octez node to display the metrics with Netdata. You have the choice between bash, python or go plugins. The best performing plugin is the plugin coded in go. However, if you want to modify a plugin to your liking, choose the language that suits you best. The plugin must be placed in a specific directory as described below:

Bash plugin:

Create a file named tezosMetrics.chart.sh in /usr/libexec/netdata/charts.d and paste the following bash code into the file.

#!/bin/bashurl=http://127.0.0.1:9091/metrics #Octez metrics' are avaible on port 9091 by default. You can changer the url variable if you haven't the same port.tezosMetrics_update_every=1tezosMetrics_priority=1lastTimestamp=tezosMetrics_get() {response=$(curl$url 2>/dev/null)#With this following command you get all metrics of your node except octez_version{...}eval "$(echo "$response" | grep -v \# | grep -v octez_version | sed -e 's/[.].*$//g' | sed -e 's/", /_/g'| sed -e 's/{/_/g' | sed -e 's/="/_/g' | sed -e 's/"}//g'| sed -e 's/ /=/g'| sed -e 's/<//g'| sed -e 's/>//g'| sed -e 's/\//_/g')"#this command allows to catch "octez_version" metriceval "$(echo "$response" | grep -v \# | grep octez_version | sed -e 's/, / \n/g' | sed -e 's/{/\n/g'| sed -e 's/}//g' | sed -e 's/ 0.*//g' |tail -n 6 |sed -e 's/^/octez_/')"return 0}#The following function is called once at netdata startup to check the availability of metricstezosMetrics_check() { status=$(curl -o /dev/null -s -w "%{http_code}\n" $url) [$status -eq 200 ] || return 1 return 0}#The following function is called once at netdata startup to create the chartstezosMetrics_create() {response=$(curl$url 2>/dev/null)metrics=$(echo "$response" | grep -v \# | grep -v octez_version | sed -e 's/", /_/g'| sed -e 's/{/_/g' | sed -e 's/="/_/g' |  sed -e 's/"}//g'| sed -e 's/[ ].*$//g'| sed -e 's/<//g'| sed -e 's/>//g'| sed -e 's/\//_/g')#CHART type.id name title units family context charttype priority update_every options plugin module#DIMENSION id name algorithm multiplier divisor options#More details here: https://learn.netdata.cloud/docs/agent/collectors/plugins.dfor metric in$metrics #this loop creates the charts and the dimensions of all metrics except octez_version metric (because of its format is a bit different)docat << EOFCHART tezosMetrics.$metric '' "$metric" "" tezosMetricsFamily tezosMetricsContext line $((tezosMetrics_priority))$tezosMetrics_update_every '' '' 'tezosMetrics'DIMENSION $metric absolute 1 1EOFdone#The code below creates the chart for octez_versioneval "$(echo "$response" | grep -v \# | grep octez_version | sed -e 's/, / \n/g' | sed -e 's/{/\n/g'| sed -e 's/}//g' | sed -e 's/ 0.*//g' |tail -n 6 |sed -e 's/^/octez_/')"cat << EOFCHART tezosMetrics.octez_version '' 'node_version:$octez_version | octez_chain_name: $octez_chain_name | octez_distributed_db_version:$octez_distributed_db_version | octez_p2p_version: $octez_p2p_version | octez_commit_hash:$octez_commit_hash | octez_commit_date: $octez_commit_date' '' tezosMetricsFamily tezosMetricsContext line$((tezosMetrics_priority)) $tezosMetrics_update_every '' '' 'tezosMetrics'DIMENSION 'octez_version' 'node_version:$octez_version | octez_chain_name: $octez_chain_name | octez_distributed_db_version:$octez_distributed_db_version | octez_p2p_version: $octez_p2p_version | octez_commit_hash:$octez_commit_hash | octez_commit_date: $octez_commit_date' absolute 1 1EOFreturn 0}#The following function update the values of the chartstezosMetrics_update() {tezosMetrics_get || return 1for metric in$metricsdocat << VALUESOFBEGIN tezosMetrics.$metric$1SET $metric =$[$metric]ENDVALUESOFdone#The code below updates the octez_version metric chartcat << EOFBEGIN tezosMetrics.octez_version$1SET octez_version = 0ENDEOFreturn 0}
Step 3: Open the metrics port of your node

When starting your node, add the --metrics-addr=:9091 option to open a port for the Octez metrics. (By default the port is 9091, but you can arbitrarily choose any other one, at your convenience. Just make sure to change the port in the plugin code accordingly.).

Here is an example, when launching your Tezos node:

tezos-node run --rpc-addr 127.0.0.1:8732 --log-output tezos.log --metrics-addr=:9091
Step 4: Restart the Netdata server

Once you've gotten the plugin on your machine, restart the Netdata server using:

sudo systemctl restart netdata
Step 5: Open the dashboard

You have in fact two ways to open your dashboard:

• You can access Netdata's dashboard by navigating to http://NODE:19999 in your browser (replace NODE by either localhost or the hostname/IP address of a remote node)

• You can also use Netdata Cloud to create custom dashboards, monitor several nodes, and invite users to watch your dashboard... It's free of charges and your data are never stored in a remote cloud server, but rather in the local disk of your machines.

We recommend to use Netdata Cloud for a more user-friendly usage. The rest of this tutorial is based on Netdata Cloud.

Step 6: Create a custom dashboard with Netdata Cloud!

Netdata provides a really intuitive tool to create custom dashboards. If you wants more details, read Netdata documentation.

Here is an exhaustive list of Octez node metrics. With Netdata, your node metrics will have this format: "tezosMetrics.name_of_your_metric".

For instance you can monitor the following relevant metrics:

• octez_version to get the chain name, version of your node, the Octez commit date, etc.
• octez_validator_chain_is_bootsrtapped to see if your node is bootstrapped (it returns 1 if the node is bootstrapped, 0 otherwise).
• octez_p2p_connections_outgoing which represents the number of outgoing connections.
• octez_validator_chain_last_finished_request_completion_timestamp which is the timestamp at which the latest request handled by the worker was completed.
• octez_p2p_peers_accepted is the number of accepted connections.
• octez_p2p_connections_active is the number of active connections.
• octez_store_invalid_blocks is the number of blocks known to be invalid stored on the disk.
• octez_validator_chain_head_round is the round (in the consensus protocol) of the current node’s head.
• ocaml_gc_allocated_bytes is the total number of bytes allocated since the program was started.
• octez_mempool_pending_applied is the number of pending operations in the mempool, waiting to be applied.

You can also monitor relevant hardware data (metrics names may differ depending on your hardware configuration):

• disk_space._ to get the remaining free space on your disks.
• cpu.cpu_freq to get the frequency of each core of your machine.
• system.ram to get the remaining free RAM and it's overall usage.

Plenty of others metrics are available, depending on your machine and needs.

This is an example of a Demo Dashboard with the following metrics:

• octez_version
• octez_validator_chain_is_bootsrtapped
• Disk Space Usage
• CPU Usage
• RAM Usage
• octez_p2p_connections_active
• octez_p2p_peers_running
• octez_store_last_written_block_size

Monitoring several nodes​

Step 1. Install Netdata on each of the machines hosting your nodes as described in step 1 of "Set up the monitoring" part.

Step 2. On Netdata Cloud click on the button Connect Nodes, and follow the instructions.

Netdata sets up alerts by default for some predefined metrics which may not fit your needs. You can set up personalized ones to be received automatically by email, Slack, or SMS.

How to reach config files?

Go to /etc/netdata and use the command edit-config to edit config files.

For example, to edit the cpu.conf health configuration file, run:

cd /etc/netdatasudo ./edit_config health.d/cpu.conf

To list config files available just use ./edit-config command. Alert file names have this format: health.d/file_name.conf

(Configuration files are stored in /usr/lib/netdata/conf.d and alerts in health.d folder)

Manage existing alerts: Edit health configuration files

You can manage existing alerts by editing health configuration file.

For example, here is the first health entity in health.d/cpu.conf:

template: 10min_cpu_usage #Name of the alarm/template.(required)      on: system.cpu #The chart this alarm should attach to.(required)      os: linux #Which operating systems to run this chart. (optional)   hosts: * #Which hostnames will run this alarm.(optional)  lookup: average -10m unaligned of user,system,softirq,irq,guest #The database lookup to find and process metrics for the chart specified through on.(required)   units: %   every: 1m #The frequency of the alarm.   #"warn" and "crit" expressions evaluating to true or false, and when true, will trigger the alarm.    warn: $this > (($status >= $WARNING) ? (75) : (85)) crit:$this > (($status ==$CRITICAL) ? (85) : (95))   delay: down 15m multiplier 1.5 max 1h #Optional hysteresis settings to prevent floods of notifications.    info: average cpu utilization for the last 10 minutes (excluding iowait, nice and steal)      to: sysadmin

Here is an exhaustive liste of alerts parameters.

Create personnalised alerts: Write a new health entity

To create personalised alerts, create a .conf file in /usr/lib/netdata/conf.d/health.d and edit this new file using the template shown previously:

As an example, here is a warning alert if the metric octez_p2p_connections_active is below 3 connections, and a critical alert if there are no connections at all (equal to 0):

alarm: tezos_node_p2p_connections_active (cannot be chart name, dimension name, family name, or chart variables names.)      on: octez_p2p_connections_active   hosts: *   lookup: average -10m unaligned of user,system,softirq,irq,guest     warn: $this<3 crit:$this==0    info:  "tezos_node_p2p_connections_actives" represents the current number of active p2p connections with your node.      to: sysadmin

(You can access official Netdata documentation here)

Basic documentation for configuring health alarms:
https://learn.netdata.cloud/docs/monitor/configure-alarms

Health configuration reference (detailed documentation):
https://learn.netdata.cloud/docs/agent/health/reference

Manage data retention​

The metrics collected every second by Netdata can be stored in the local memory of your machine for a variable duration of time, depending on the desired history. This allows you to keep track of the activity of your node and to watch its evolution over time.

Get netdata.conf file

You can choose how long data will be stored on your local machine. To do so, you will need the netdata.conf file in the /etc/netdata/ repository. If you haven't this file yet, you can download it using the following command:

wget -O /etc/netdata/netdata.conf http://localhost:19999/netdata.conf

You can view your current configuration at: http://localhost:19999/netdata.conf

Calculate the data base space needed

To calculate the space needed to store your metrics you will need:

• The time in seconds of how long time you want to keep the data
• The number of metrics (you can find it in your localhost dashboard as shown in the image below)
• The Tier of Netdata (by default 0).

For 2000 metrics, collected every second and retained for a month, Tier 0 needs: 1 byte x 2,000 metrics x 3,600 secs per hour x 24 hours per day x 30 days per month = 5,184MB.

Modify netdata.conf file

Now you just have to change the value of "dbengine multihost disk space" in [db] section in netdata.conf file by the value calculated before (in MB).

[db] dbengine multihost disk space MB =5184 dbengine page cache size MB = 32 # update every = 1 # mode = dbengine # dbengine page cache with malloc = no # dbengine disk space MB = 256 # memory deduplication (ksm) = yes # cleanup obsolete charts after secs = 3600 # gap when lost iterations above = 1 # storage tiers = 1 # dbengine page fetch timeout secs = 3 # dbengine page fetch retries = 3 # dbengine page descriptors in file mapped memory = no # cleanup orphan hosts after secs = 3600 # delete obsolete charts files = yes # delete orphan hosts files = yes # enable zero metrics = no # dbengine pages per extent = 64

Full monitoring of an Octez node with octez metrics​

Here is the global picture of a monitoring system, connecting all these tools together:

A Grafazos dashboard looks like this:

As you can immediately see at the top, the dashboard will tell you your node's bootstrap status and whether it's synchronized, followed by tables and graphs of other data points.

Node metrics​

In previous versions of Octez, a separate tool was needed for this task. tezos-metrics exported the metrics which were computed from the result of RPC calls to a running node. However, since the node API changed with each version, it required tezos-metrics to update alongside it, resulting in as many versions of tezos-metrics as Octez itself. Starting with Octez v14, metrics were integrated into the node and can be exported directly, making it simple to set up. Moreover, as the metrics are now generated by the node itself, no additional RPC calls are needed anymore. This is why the monitoring is now considerably more efficient!

Setting up Octez Metrics​

To use Octez Metrics, you just start your node with the metrics server enabled. The node integrates a server that registers the implemented metrics and outputs them for each /metrics HTTP request.

When you start your node you add the --metrics-addr option which takes as a parameter <ADDR:PORT> or <ADDR> or :<PORT>. This option can be used either when starting your node, or in the configuration file (see https://tezos.gitlab.io/user/node-configuration.html).

Your node is now ready to have metrics scraped with requests to the metrics server. For instance, if the node server is configured to expose metrics on port 9932 (the default), then you can scrape the metrics with the request http://localhost:9932/metrics. The result of the request is the list the node metrics described as:

#HELP metric description#TYPE metric typeoctez_metric_name{label_name=label_value} x.x

Note the metrics are implemented to have the lowest possible impact on the node performance, and most of the metrics are only computed when scraping it. So starting the node with the metrics server shouldn't be a cause for concern. More details on Octez Metrics can be found in the Tezos Developer Documentation: see here for further detail on how to setup your monitoring; and here, for the complete list of the metrics scrapped by the Octez node.

Types of metrics​

The available metrics give a full overview of your node, including its characteristics, status, and health. In addition, they can give insight into whether an issue is local to your node, or it is affecting the network at large -- or both.

The metric octez_version delivers the node's main properties through label-value pairs. It provides the node version or the network it is connected to.

Other primary metrics you likely want to see are the chain validator1 ones, which describe the status of your node: octez_validator_chain_is_bootstrapped and octez_validator_chain_synchronisation_status. A healthy node should always have these values set to 1. You can also see information about the head and requests from the chain validator.

There are two other validators, the block validator2 and the peer validator3, which give you insight on how your node is handling the progression of the chain. You can learn more about the validators here

To keep track of pending operations, you can check the octez_mempool metric.

You can get a view of your node's connections with the p2p layer metrics (prefixed with octez_p2p). These metrics allows you to keep track of the connections, peers and points of your node.

The store can also be monitored with metrics on the save-point, checkpoint and caboose level, including the number of invalid blocks stored, the last written block size, and the last store merge time.

Finally, if you use the RPC server of your node, it is likely decisive in the operation of your node. For each RPC called, two metrics are associated: octez_rpc_calls_sum{endpoint="...";method="..."} and octez_rpc_calls_count{endpoint="...";method="..."} (with appropriate label values). call_sum is the sum of the execution times, and call_count is the number of executions.

Note that the metrics described here are those available with Octez v14--it is likely to evolve with future Octez versions.

Dashboards​

While scraping metrics with server requests does give access to node metrics, it is, unfortunately, not enough for useful node monitoring. Since it only gives a single slice into the node's health, you don't really see what's happening over time. Therefore, a more useful way of monitoring your node is to create a time series of metrics.

Indeed, if you liked the poster, why not see the whole movie?

The Prometheus tool is designed for this purpose which collects metric data over time.

In addition, in order to get the most out of your metrics, it should be associated with a visual dashboard. A Grafana dashboard, generated by Grafazos gives a greater view into your node. Once your node is launched, you can provide extracted time series of metrics to Grafana dashboards.

The Grafazos version for Octez v14 provides the following four ready-to-use dashboards:

• octez-compact: A compact dashboard that gives a brief overview of the various node metrics on a single page.
• octez-basic: A basic dashboard with all the node metrics.
• octez-with-logs: Same as basic but also displays the node's logs, with Promtail Promtail (for exporting the logs).
• octez-full: A full dashboard with the logs and hardware data. This dashboard should be used with Netdata (for supporting hardware data) in addition to Promtail.

Note that the last two dashboards require the use of additional (though standard) tools for hardware metrics and logs (Netdata, Loki, and Promatail).

Let's look at the basic dashboard in more detail. The dashboard is divided into several panels. The first one is the node panel, which can be considered the main part of the dashboard. This panel lays out the core information on the node such as its status, characteristics, and statistics on the node's evolution (head level, validation time, operations, invalid blocks, etc.).

The others panels are specific to different parts of the node:

• the p2p layer;
• the workers;
• the RPC server;

along with a miscellaneous section.

Some metrics are self-explanatory, such as P2P total connections, which shows both the connections your node initiated and the number of connections initiated by peers. Another metric you may want to keep an eye on is Invalid blocks history, which should always be 0 -- any other value would indicate something unusual or malicious is going on.

Another useful metric is the Block validation time, which measures the time between when a request is registered in the worker till the worker pops the request and marks it complete. This should generally be under 1 second. If it's persistently longer, that could indicate trouble too.

The P2P connections graph will show you immediately if your node is having trouble connecting to peers, or if there's a drop-off in the number of connections. A healthy node should typically have a few dozen peer connections (depending on how it was configured).

The Peer validator graph shows a number of different metrics including unavailable protocols. An up-to-date, healthy node should see this as a low number. If not it can indicate that your node is running an old version of Octez, or that your node is being fed bad data from peers.

Note again these dashboards are built for Octez v14 and are likely to evolve with the Octez versions.

Working with Grafazos​

Grafazos allows you to set different options when generating the ready-to-use dashboards described above. For instance, you can specify the node instance label, which is useful for a dashboard that aims to monitor several nodes.

Furthermore, you can manually explore the metrics from the Prometheus data source with Grafana and design your own dashboards. Or you can also use Grafazos to import ready-to-use dashboards for your node monitoring. You can find the packages stored here. There is a package for each version of Octez.

Grafana is a relatively user-friendly tool, so play with creating a custom one as you like. You may also want to use the "explore" section of Grafana. Grafazos is also particularly useful in automatic deployment of Tezos nodes via provisioning tools such as Puppet or Ansible.

Conclusion​

We developed Octez Metrics to give Tezos users a better insight into how their node is performing, and to observe the overall network health. This is a product that will continuously improve, and we encourage node operators and bakers to suggest new features and additional metrics that they would like to see. We recognize that the best way to keep your node healthy -- and in turn, to keep the entire Tezos network healthy -- is to provide everyone the tools needed to monitor their setup.

1. The chain validator is responsible for handling valid blocks and selecting the best head for its chain.
2. The block validator validates blocks and notifies the corresponding chain validator.
3. Each peer validator treats new head proposals from its associated peer, retrieving all the operations, and if valid, triggers a validation of the new head.