Frequent requests to our support include questions about organizing monitoring for WebRTC streaming. As a rule, it is important for a streamer to know what is happening on the “other side” – i.e. to assess the stream quality, the number of viewers and other parameters. The quality of the stream, as has already been discussed many times, is not constant and depends on many factors, such as the load on the server with or without transcoding, and the use of TCP or UDP transport protocols, and the presence of packet loss and/or NACK feedbacks, etc. All these data for assessing stream quality can be obtained manually from various sources.

Stream degradation is a condition of a video/audio stream, in which the picture and sound quality is not satisfactory. There are artifacts, friezes, stuttering, or out of sync sound.

artefact_stream_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

10 Important Metrics

1. CPU

If the processor is loaded by more than 80%, and/or the CPU load periodically rises to 100%, this means that the server is overloaded and does not have enough processing power to perform work operations, which will certainly lead to degradation of streams.

You can estimate the current processor load using the htop software. Or top. Or mpstat. The screenshot below shows htop:

htop_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

Usually, the Load average parameter is estimated – the average value of the system load. This parameter is displayed as three values, which represent the average load data for the last 1, 5 and 15 minutes. The lower the value, the better.

load average: 4.55 4.22 4.18

For example, such values for a quad-core processor mean that it works at the limit of its capabilities for a long time (several hours), and such work will definitely lead to degradation of streams.

2. Degraded streams percentage

The next metric, which is directly related to the previous one: degraded_streams_percent – the percentage of degraded streams. A stream is degraded if the server does not have time to process the queue of this stream.

The server is multithreading. The video data of the published stream is queued, and the data is sent in a separate thread. In case of a high load on the CPU and a lack of performance for encoding/decoding or with internal locks, these queues can grow, and streams can degrade. Therefore, the percentage of degraded streams is an important metric that signals a loss of quality.

You can find the value of degraded_streams_percent on the statistics page http://demo.flashphoner.com:8081/?action=stat (where demo.flashphoner.com is the address of the WCS server)

page_stat_degraded_streams_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

The following few metrics are relevant to the Java Virtual Machine:

3. Java GC pause

Garbage collection pause time. Like in the case of the Load average, the lower the number, the better. 100 ms is very bad. The situation when a Java machine starts the garbage collector is called “Stop the world,” which means that your application will be completely stopped while the garbage collection is running. All server threads will be stopped for a moment, and this is enough to degrade the streams.

4. Java Heap

The size of Java Heap memory used. If it approaches -Xmx and remains constantly at these values, this means the constant work of the garbage collector,  and as a result, high CPU consumption, possible degradation of streams, as well as an increase in the likelihood of the “OutOfMemoryError” error, which is a signal of an irreversible violation of the internal state server. If such an error occurs at least once, the server is already broken, so it is recommended to restart it.

When an OutOfMemoryError appears, the Java machine can stop any thread, thereby disrupting server operation, although externally the server will continue to work and service connections.

5. Physical memory

The -Xmx limit for heap does not guarantee a limit on RES physical memory usage. If the heap was given 16 GB, RES can take 32 GB or more, and this is not always a memory leak.

Especially in heavy mixing and transcoding scenarios. Therefore, the “real memory” must be monitored and not be exhausted, since this will terminate the server process by the system itself.

These metrics can be tracked using the Java Mission Control tool. Java Mission Control is a powerful tool for monitoring and debugging a Java machine that comes with the JDK and runs on the user’s (administrator’s) PC.

JMC_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

6. Transcoder queues and transcoding time

Transcoding is a separate scenario with a high load on server resources. In paragraph 2, we already wrote about Degraded streams. Now, this is a later metric. In the case of transcoders, first the metrics of the queues of transcoders will show the excess and only then the Degraded streams will react. Transcoder queue metrics show the number of frames accumulated in the decode/encoding queue. Frames are heavy, and 30 accumulated frames means 1 second of latency with a frame rate of 30 FPS.

The decoding/encoding time tells us how quickly the corresponding operations go. For example, at 30 FPS, the encoding operation for one frame should take no more than 1/30 second = 33 milliseconds. Otherwise, the transcoder will not be able to deliver frames on time, which will ruin the stream.

Decoding/encoding queue metrics:

transcoding_video_decoding_max_queue_size
transcoding_video_encoding_max_queue_size

Maximum decoding/encoding time metrics

transcoding_video_decoding_max_time
transcoding_video_encoding_max_time

These metrics are available on the statistics page:

page_stat_transcoding_decoding_encoding_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

7. Number of free ports for streaming

Metric: ports_media_free

An important metric for highly loaded servers.
Ports are allocated for each stream. If the ports in the range run out, adding new streams on the server will cause an error.

Information on the number of ports can be found on the statistics page:

page_stat_port_stats_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

8. Stream audio and video synchronization

Metric: streams_synchronization

An important metric, which is also called lipsync (synchronization of lips and voice in video broadcasts). Out of sync can be caused by a variety of reasons, ranging from poor quality stream source to server performance issues.

With an out of sync of 100 milliseconds or more, the human ear and eye will already notice that something is wrong. The metric allows you to see this moment in numbers, in milliseconds for each stream; a positive value of the metric shows that audio is playing ahead of the video, and a negative value indicates that audio is lagging behind the video.

The screenshot below shows normal synchronization of audio and video in streams. The metric shows values from +2 to +79 milliseconds.

page_stat_streams_synchronization_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

9. Number of open transcoders

Transcoders, namely encoders, have the maximum impact on server load. Therefore, it is important to control their number.
If your script does not involve the use of transcoding (for example, streaming without changing the resolution), then the number of decoders and encoders should be zero.

Metrics:

native_resources.video_decoders
native_resources.video_encoders

In the screenshot below, transcoding is not used.

page_stat_native_resources_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

10. Total load on transcoders

The number of transcoders is a good metric, but the two encoders can differ by an order of magnitude in terms of CPU utilization.
For example, stream encoding 240p 15 FPS and 1080p 60 FPS. Therefore, to get an idea of the loading, it is important to consider not only the number of encoders, but their weight.

The encoder weight is: w x h x FPS, where w and h is the image resolution. As a result, the metric allows to display the real load of the server by encoders, and one heavyweight encoder can show a higher load than 10 lightweight ones.

Metrics:

transcoding_video_decoding_load
transcoding_video_encoding_load

page_stat_load_decoding_encoding_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

You will agree that the set of metrics on the statistics page is quite extensive, so it is not very convenient to click all this in real time. Especially if you have an active stream in parallel.

Therefore, various monitoring systems have long been invented. They collect all statistical information and show it to interested parties in a convenient form. In addition to observing the parameters, the monitoring system can help in the timely detection and elimination of any malfunctions during the broadcast.

So, you have already understood that the monitoring system is an absolute must-have, it remains only to choose a specific “model.” Today there is a huge variety of monitoring systems for every taste, color and wallet. Let’s get acquainted with a relatively young open source monitoring system, Prometheus.

Prometheus is an open source (Apache 2.0) time series DBMS written in the Go language. It just stores your metrics. An interesting feature of Prometheus is that it collects metrics from a given set of services (performs pulling). Prometheus consists of separate components that communicate with each other via http and is configured in yaml-configs. This is not a plug & play solution. Prometheus is a set of tools that allows you to set up yourself the kind of monitoring you need. In our article, we will discuss how to set up monitoring of WCS server and WebRTC streams in Prometheus.

Preparing for installation

If your system uses a firewall, you need to open ports

  • TCP 9090 — http for Prometheus server;
  • TCP 9093 — http for alert manager;
  • TCP and UDP 9094 — for alert manager;
  • TCP 9100 — for node_exporter;
  • TCP 3000 — http for Grafana server.

 

For example, using iptables:

iptables -I INPUT 1 -p tcp --match multiport --dports 9090,9093,9094,9100 -j ACCEPT
iptables -A INPUT -p udp --dport 9094 -j ACCEPT

If you are using a Red Hat-based operating system, SELinux will probably be enabled by default. You can disable it using the commands

sudo setenforce 0
sudo sed -i 's/^SELINUX=.*/SELINUX=disabled/g' /etc/selinux/config

We are now ready to install Prometheus and Grafana.

Installing and configuring Prometheus monitoring

Prometheus does not have its own repositories for popular distributions, and the official operating system repositories tend to lag behind the upstream. Therefore, we will look at the manual installation. This method is suitable for any Linux distributions running on systemd (Ubuntu, Debian, Centos, Arch, etc.). You will need to download the source, create a user, manually copy the necessary files, assign rights, and create a unit for autorun.

You can download the source from the official download page. Copy the link to the Linux package and download it using wget:

wget https://github.com/prometheus/prometheus/releases/download/v2.21.0/prometheus-2.21.0.linux-amd64.tar.gz

Create directories to copy the files for Prometheus:

mkdir /etc/prometheus
mkdir /var/lib/prometheus

Unpack the downloaded archive:

tar zxvf prometheus-*.linux-amd64.tar.gz

Go to the directory with the unpacked files:

cd prometheus-*.linux-amd64

Distribute files into directories:

cp prometheus promtool /usr/local/bin/
cp -r console_libraries consoles prometheus.yml /etc/prometheus

Create a user that will start the monitoring system:

sudo useradd --no-create-home --shell /bin/false prometheus

Set the owner for the directories that we created in the previous step:

sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus

Set the owner for the copied files:

sudo chown prometheus:prometheus /usr/local/bin/{prometheus,promtool}

Configure Prometheus autorun:

Create a prometheus.service file:

sudo nano /etc/systemd/system/prometheus.service

Place the following text in it:

[Unit]
Description=Prometheus Service
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file /etc/prometheus/prometheus.yml \
--storage.tsdb.path /var/lib/prometheus/ \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure

[Install]
WantedBy=multi-user.target

Re-read the systemd configuration:

sudo systemctl daemon-reload

Allow autorun:

sudo systemctl enable prometheus

After creating autorun, start Prometheus as a service:

sudo systemctl start prometheus

In the browser, open http://<IP-адрес сервера>:9090, and if everything is done correctly, we get to the Prometheus monitoring system console. So far, it is useless, but you can already press and see the menus.

web_interface_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

So, let’s connect WCS to monitoring. Open the prometheus.yml settings file for editing

sudo nano /etc/prometheus/prometheus.yml

Add a description of the monitoring configuration to the file:

scrape_configs:
  - job_name: 'flashphoner'
    metrics_path: '/'
    params:
      action: [stat]
      format: [prometheus]
    static_configs:
      - targets: ['WCS_address:8081']

where:

  • WCS_address – WCS server address;
  • 8081 – WCS server port for displaying statistics.

 

configuration_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

After changing the settings of prometheus.yml, restart the service:

sudo systemctl restart prometheus

Return to the Prometheus console, select the “Status => Targets” menu item:

choose_targets_web_interface_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

Make sure that Prometheus receives data from WCS (the target is in the “UP” state):

targets_web_interface_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

This concludes the monitoring setup. Let’s move on to setting up charts.

Installing and configuring Grafana

Grafana is a web interface for various time series DBMSs such as Graphite, InfluxDB and Prometheus. In general, Grafana can draw beautiful graphs using information from Prometheus. Earlier we saw that Prometheus also had its own web interface, but it was extremely minimalistic and rather inconvenient. Therefore, even the Prometheus developers themselves recommend using Grafana. Well, who are we to argue with the developers?

Let’s get started.

Grafana as well as Prometheus will be installed manually using the deb package. Find a link to a fresh package and download it using wget at https://grafana.com/grafana/download:

wget https://dl.grafana.com/oss/release/grafana_7.2.1_amd64.deb

Unpack it:

sudo dpkg -i grafana_7.2.1_amd64.deb

Allow autorun:

sudo systemctl enable grafana-server

Run it:

sudo systemctl start grafana-server

Open the web interface at http://<IP-адрес сервера>:3000. Default login and password: admin/admin. When you first log in, Grafana will immediately prompt you to change your password.

Change the password and add Prometheus as a data source:

Click the “Add data source” button:

add_data_source_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

and select Prometheus:

choose_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

Specify the parameters for the connection. It is enough to specify the address and port of the Prometheus Web interface. Press the Save & Test button to save:

settings_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

After successful connection, create a new panel for charts. Choose from the menu on the left side of the Create – Dashboard window:

create_dashboard_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

Click the “Add new panel” button:

add_new_panel_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

On the “Query” tab, select the previously created Prometheus (1) as a data source and select the metrics we are interested in from the list (2):

dashboard_metrics_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

Then you can go through the rest of the points by selecting specific parameters and the type of chart. After that, save the panel setting (3).

Set the dashboard name and save:

save_dashboard_metrics_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

In the screenshot above, the graph displays all available metrics for stream_stats.

You can make a graph for a specific metric that interests you. To do this, you need to specify its name in the “Metrics” field. For example, in the screenshot below, we selected the “Number of incoming RTSP streams” metric:

dashboard_choose_metrics_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

Now let’s create graphs for the key metrics that we discussed at the beginning of this article.

Metric selection lines:

Degraded streams percentage:

degraded_streams_stats{instance="demo.flashphoner.com:8081", job="flashphoner", param="degraded_streams"}
degraded_streams_stats{instance="demo.flashphoner.com:8081", job="flashphoner", param="degraded_streams_percent"}

Java heap:

core_stats{instance="demo.flashphoner.com:8081", job="flashphoner", param="core_heap_memory_used"}

Physical memory usage for Java:

core_stats{instance="demo.flashphoner.com:8081", job="flashphoner", param="core_java_freePhysicalMemorySize"}
core_stats{instance="demo.flashphoner.com:8081", job="flashphoner", param="core_java_totalPhysicalMemorySize"}

Transcoder queues and transcoding time:

transcoding_stats{instance="demo.flashphoner.com:8081", job="flashphoner", param="transcoding_video_decoding_max_queue_size"}
transcoding_stats{instance="demo.flashphoner.com:8081", job="flashphoner", param="transcoding_video_encoding_max_queue_size"}
transcoding_stats{instance="demo.flashphoner.com:8081", job="flashphoner", param="transcoding_video_decoding_max_time"}
transcoding_stats{instance="demo.flashphoner.com:8081", job="flashphoner", param="transcoding_video_encoding_max_time"}

Number of free ports for streaming:

port_stats{instance="demo.flashphoner.com:8081", job="flashphoner", param="ports_media_free"}

Audio and video synchronization in streams:

streams_stats{instance="demo.flashphoner.com:8081", job="flashphoner", name="s1", param="streams_synchronization"}

Number of open transcoders:

native_resources{instance="demo.flashphoner.com:8081", job="flashphoner", param="native_resources.video_decoders"}
native_resources{instance="demo.flashphoner.com:8081", job="flashphoner", param="native_resources.video_encoders"}

Total load on transcoders:

transcoding_stats{instance="demo.flashphoner.com:8081", job="flashphoner", param="transcoding_video_encoding_load"}
transcoding_stats{instance="demo.flashphoner.com:8081", job="flashphoner", param="transcoding_video_decoding_load"}

Dashboard view with charts:

graph_metrics_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

Unfortunately, so far we have not received all the metrics we are interested in. There are 10 metrics, and only 8 charts were built. Let’s find the missing charts – CPU and Java GC pause.

Operating system metrics

To get metrics from the operating system in which WCS is deployed, install and configure node_exporter. Since in our case, the WCS server and the Prometeus server are different machines, it is necessary to open tcp port 9100 on the WCS server.

For example, like this:

bash
sudo firewall-cmd --zone=public --add-port=9100/tcp --permanent

The further installation process is the same as for Prometheus.

Download node_exporter from the official page at the link

wget https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-amd64.tar.gz

Unpack the downloaded archive:

tar zxvf node_exporter-*.linux-amd64.tar.gz

And go to the directory with the unpacked files:

cd node_exporter-*.linux-amd64

Copy the executable file to bin:

cp node_exporter /usr/local/bin/

Create user nodeusr:

sudo useradd --no-create-home --shell /bin/false nodeusr

We set the owner for the executable file:

sudo chown -R nodeusr:nodeusr /usr/local/bin/node_exporter

To configure autorun in systemd, create a node_exporter.service file:

nano /etc/systemd/system/node_exporter.service

Place the following text in it:

[Unit]
Description=Node Exporter Service
After=network.target

[Service]
User=nodeusr
Group=nodeusr
Type=simple
ExecStart=/usr/local/bin/node_exporter
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure

[Install]
WantedBy=multi-user.target

Re-read the systemd configuration:

sudo systemctl daemon-reload

Allow autorun:

sudo systemctl enable node_exporter

Run the service:

sudo systemctl start node_exporter

Open a web browser and go to http://<IP address of the WCS server>:9100/metrics — we will see the metrics collected by node_exporter:

page_node_exporter_metrics_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

Now let’s connect node_exporter on WCS to Prometheus server.

On the Prometheus server, open the prometheus.yml configuration file for editing:

sudo nano /etc/prometheus/prometheus.yml

and add a new task to collect data from node_exporter to WCS. Be careful with the indentation in the yml file:

 - job_name: 'WCS_node_exporter'
    scrape_interval: 5s
    static_configs:
    - targets: ['demo.flashphoner.com:9100']

configuration_node_exporter_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

After changing the settings file prometheus.yml, restart the service:

sudo systemctl restart prometheus

Go to Grafana and add a graph for CPU load to the previously created dashboard:

choose_cpu_load_average_metrics_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

CPU metrics:

node_load1
node_load5
node_load15

Collecting statistics using external scripts

WCS provides the ability to collect statistics using an external script.

To enable this function, you need to specify the path to the script file for collecting statistics in the “flashphoner.properties” file, for example, for the custom_stats.sh file located in the /usr/local/FlashphonerWebCallServer/bin directory, it can be written as follows:

custom_stats_script=custom_stats.sh

If you have a different path to the script file, then you need to specify it in full:

custom_stats_script=/path/to/custom_stats.sh

Let’s write a script that collects pause data for the Java garbage collector

#!/bin/bash

WCS_HOME="/usr/local/FlashphonerWebCallServer"
LAST_LOG=$(ls -t ${WCS_HOME}/logs/ | grep gc-core | head -1)
LOG="${WCS_HOME}/logs/${LAST_LOG}"
JAVA_VER=$(java -version 2>&1 | head -n 1 | awk -F '"' '{print $2}')
 
TYPE_GC="$(grep -Pv '^(#|$)' ${WCS_HOME}/conf/wcs-core.properties | grep -oE 'ConcMarkSweepGC|ZGC')"

#GC
if [[ $JAVA_VER != "1"[0-9]* ]]; then
        gc_pause=$(grep 'Allocation Failure' $LOG | tail -1 | awk -F'->' '{print $3}' | sed -rn 's/([0-9]+)K\(([0-9]+)K\), ([0-9]+.[0-9]+).*/\3/p' | tr , . | awk '{printf "%f\n", $1 * 1000 }')
		echo "gc_pause=$gc_pause"
#ZGC
elif [[ $JAVA_VER == "1"[0-9]* ]]; then
    if [[ $TYPE_GC == "ConcMarkSweepGC" ]]; then
        gc_pause=$(grep 'Allocation Failure' $LOG | tail -1 | awk '{print $8}' | sed 's/ms$//')
		echo "gc_pause=$gc_pause"
    fi
    if [[ $TYPE_GC == "ZGC" ]]; then
        gc_pause=$(grep '.*GC.*Pause' $LOG | awk -F 'Pause Mark Start|End|Relocate Start' '{print $2}' | tail -3 | sed 's/ms$//' | awk '{a=$1; getline;b=$1;getline;c=$1;getline;t=a+b+c;print t}')
		echo "gc_pause=$gc_pause"
    fi
fi

This script requests for Java version installed in the system and, according to this information, finds lines with data on pauses for garbage collection for GC or ZGC in the log and outputs this value (in milliseconds) to the “gc_pause” parameter.

The statistics that we have collected using an external script are displayed on the statistics page http://demo.flashphoner.com:8081/?action=stat in the “Custom info” section:

page_stat_custom_info_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

The previously configured monitoring system Prometheus takes it from there.

Now let’s add a graph for this metric to the dashboard in Grafana:

choose_custom_stats_gc_pause_metrics_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

Metric for Java GC pause:

custom_stats{instance="demo.flashphoner.com:8081", job="flashphoner", param="gc_pause"}

General view of a dashboard with graphs for monitoring WCS performance metrics

graph_all_metrics_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

We have collected and built graphs for server performance metrics. Now let’s look at how you can build graphs for the parameters of a specific stream that will be published on the server.

Collecting metrics for streams

If you execute a REST request

/stream/metrics

the result will be a fairly extensive list of metrics. And, if everything is clear from the metrics like “VIDEO_HEIGHT,” “VIDEO_WIDTH,” “VIDEO_CODEC” and “AUDIO_CODEC,” then I propose to get acquainted with other metrics in more detail

The metrics “AUDIO_SYNC” and “VIDEO_SYNC” are familiar metrics for audio and video synchronization in published streams.

The metric “VIDEO_K_FRAMES” — Кthe number of the video’s key frames (I-frames). A key frame contains all information about the image in the frame and does not depend on other frames. Keyframes have the lowest compression ratio. If the browser does not send keyframes evenly, this can lead to freezes when playing the stream. The following is closely related to this metric:

The metric “VIDEO_PLI” — the number of PLI requests for keyframes from the browser. PLI can be sent when the stream receiver has lost a full frame or multiple frames. A large number of PLI requests indicates a bad channel.

The metrics “VIDEO_P_FRAMES” and “VIDEO_B_FRAMES” — the number of P-frames (intermediate or tween) and B-frames (bidirectional), respectively. A tween can refer to blocks of an image in a previous key or previous tweens. This allows you to make intermediate frames smaller than the key ones: they contain less information about the image. A bi-directional frame does not contain the entire image either. But unlike a tween, it can also refer to subsequent blocks, up to the next tween – hence its name. Bidirectional frames take up even less space than intermediate frames. The method of decoding a bidirectional frame is even more complicated than decoding an intermediate frame: first, the next frame after the bidirectional intermediate frame is decoded according to the scheme described above; the bidirectional frame is then decoded using the information about neighboring frames.

It turns out that the more often B frames appear in a stream, the more resources are required to decode this stream, which can negatively affect the video quality. Moreover, browsers cannot play B-frames correctly, and if there are bidirectional frames, the image will twitch.

The metrics “AUDIO_RATE” and “VIDEO_RATE” show the bitrate of audio or video.

The metric “VIDEO_FPS” refers to video frame rate.

Three more metrics related to packet loss on communication channels:

“VIDEO_NACK” — the number of NACK requests. NACK — it is a way for the receiving party to indicate that it has not received a particular packet or list of packets.

“VIDEO_LOST” — the number of video packets lost.

“AUDIO_LOST” — the number of audio packets lost.

A large number of NACK requests and lost packets also mean problems with communication channels. This is manifested by freezing in the stream, loss of sound and other unpleasant stuff.

Let’s connect these metrics to collection in Prometheus. To do this, in the settings file prometheus.yml

sudo nano /etc/prometheus/prometheus.yml

add a new task to collect data on metrics for a stream. Be careful with the indentation in the yml file:

 - job_name: 'stream_metrics'
    metrics_path: '/'
    params:
      action: [stat]
      format: [prometheus]
      groups: [publish_streams]
    static_configs:
    - targets: ['demo.flashphoner.com:8081']

configuration_stream_metrics_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

Save the configuration and restart Prometheus:

sudo systemctl restart prometheus

Moving on to Grafana. I propose to create a new panel for stream metric charts. Select Create – Dashboard from the menu on the left side of the window

create_dashboard_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

Click the “Add new panel” button

add_new_panel_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

On the “Query” tab, select Prometheus as a data source and select publish => publish_streams from the list of metrics:

choose_publish_stream_metrics_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

As a result, we get graphs for all available metrics for each published stream. For clarity, you can display the metrics of interest on a separate graph. For example, a list of metrics for a stream published on demo.flashphoner.com named “stream1”

publish_streams{instance="demo.flashphoner.com:8081", job="stream_metrics", name="stream1", param="AUDIO_CODEC"}
publish_streams{instance="demo.flashphoner.com:8081", job="stream_metrics", name="stream1", param="AUDIO_LOST"}
publish_streams{instance="demo.flashphoner.com:8081", job="stream_metrics", name="stream1", param="AUDIO_RATE"}
publish_streams{instance="demo.flashphoner.com:8081", job="stream_metrics", name="stream1", param="AUDIO_SYNC"}
publish_streams{instance="demo.flashphoner.com:8081", job="stream_metrics", name="stream1", param="VIDEO_B_FRAMES"}
publish_streams{instance="demo.flashphoner.com:8081", job="stream_metrics", name="stream1", param="VIDEO_CODEC"}
publish_streams{instance="demo.flashphoner.com:8081", job="stream_metrics", name="stream1", param="VIDEO_FPS"}
publish_streams{instance="demo.flashphoner.com:8081", job="stream_metrics", name="stream1", param="VIDEO_HEIGHT"}
publish_streams{instance="demo.flashphoner.com:8081", job="stream_metrics", name="stream1", param="VIDEO_K_FRAMES"}
publish_streams{instance="demo.flashphoner.com:8081", job="stream_metrics", name="stream1", param="VIDEO_LOST"}
publish_streams{instance="demo.flashphoner.com:8081", job="stream_metrics", name="stream1", param="VIDEO_NACK"}
publish_streams{instance="demo.flashphoner.com:8081", job="stream_metrics", name="stream1", param="VIDEO_PLI"}
publish_streams{instance="demo.flashphoner.com:8081", job="stream_metrics", name="stream1", param="VIDEO_P_FRAMES"}
publish_streams{instance="demo.flashphoner.com:8081", job="stream_metrics", name="stream1", param="VIDEO_RATE"}
publish_streams{instance="demo.flashphoner.com:8081", job="stream_metrics", name="stream1", param="VIDEO_SYNC"}
publish_streams{instance="demo.flashphoner.com:8081", job="stream_metrics", name="stream1", param="VIDEO_WIDTH"}

An example of such a dashboard is below:

graph_metrics_publish_stream_prometheus_grafana_monitoring_streaming_WCS_WebRTC_HLS_Browser_statistic_metrics

As you can see, Prometheus is an interesting and promising tool. Our article cannot claim to be a complete description of this monitoring system, because we have reviewed a special case here, applicable to monitoring WCS and streams. For example, with Prometheus you can also set up alerts, which we did not consider. After reading this article, you now know how Prometheus and Grafana are used, and you can successfully apply them in practice.

Have a nice streaming!

Links

Demo

WCS on Amazon EC2

WCS on DigitalOcean

WCS in Docker