'ERROR failed to set scheduler' after restart of voxl-qvio-server

Tjark

When voxl-qvio-server is started, it sets itself to use the FIFO scheduler with high priority. This is done here: https://gitlab.com/voxl-public/voxl-sdk/services/voxl-qvio-server/-/blob/master/server/main.cpp#L973

When the drone boots, this is successful the first time. journalctl -u voxl-qvio-server reports:

Jan 01 00:00:08 Drone_201 voxl-qvio-server[2552]: setting scheduler
Jan 01 00:00:08 Drone_201 voxl-qvio-server[2552]: set FIFO priority successfully!

Now when I execute systemctl restart voxl-qvio-server it is unable to set itself to use the FIFO scheduler with high priority. journalctl -u voxl-qvio-server reports:

Jul 01 08:23:52 Drone_201 voxl-qvio-server[11943]: WARNING Failed to set priority, errno = 1
Jul 01 08:23:52 Drone_201 voxl-qvio-server[11943]: This seems to be a problem with ADB, the scheduler
Jul 01 08:23:52 Drone_201 voxl-qvio-server[11943]: should work properly when this is a background process
Jul 01 08:23:52 Drone_201 voxl-qvio-server[11943]: ERROR failed to set scheduler

It seems that this is similar to the issue reported here: https://bugzilla.redhat.com/show_bug.cgi?id=1467919.

What can we do to make sure that voxl-qvio-server is always running with the FIFO scheduler?

Version information:

yocto:~$ opkg list | grep voxl
libvoxl_cutils - 0.0.2 - ModalAI's c utils
libvoxl_io - 0.5.4 - ModalAI library allowing apps processor access to accessory serial ports
voxl-camera-calibration - 0.0.1 - On-board camera calibration for VOXL
voxl-camera-server - 0.8.1 - publishes camera frames over named pipe interface
voxl-cpu-monitor - 0.1.7 - publishes CPU Data over MPA pipe and provides fan tools
voxl-docker-support - 1.1.3 - tools to improve the usability of docker on VOXL
voxl-imu-server - 0.8.1 - VOXL IMU interface for Modal Pipe Architecture
voxl-mavlink - 0.0.2 - mavlink headers
voxl-mpa-tools - 0.2.7 - misc tools for modal pipe architecture
voxl-nodes - 0.1.7 - ROS nodes supported by ModalAI
voxl-portal - 0.1.2
voxl-qvio-server - 0.3.1 - publishes QVIO data over named pipe interface
voxl-qvio-server - 0.3.4
voxl-streamer - 0.2.3 - Gstreamer-based application to handle RTSP streaming
voxl-tag-detector - 0.0.2 - Detect apriltags for MPA
voxl-tflite - 0.0.1 - 64-bit tensorflow lite libraries
voxl-tflite-server - 0.1.1 - client of voxl-camera-server that does deep learning (object detection, monocular depth estimation)
voxl-utils - 0.8.5
voxl-vision-px4 - 0.9.2 - Interface between VOXL's computer vision services and PX4

Things I already have figured out from looking at the link mentioned above.

When the drone starts up the voxl-qvio-server task lives in the root cgroup:

yocto:~$ cat /sys/fs/cgroup/cpu,cpuacct/tasks | grep $(pidof voxl-qvio-server)
2514

Therefore it uses the realtime runtime budget of the root. This is 0.95 seconds per second (default values):

yocto:~$ cat /sys/fs/cgroup/cpu,cpuacct/cpu.rt_runtime_us
950000

It needs a realtime runtime budget bigger than 0 to be able to set the scheduler to FIFO so this is good.

Now when I execute systemctl restart voxl-qvio-server things change. The task doesn't live in the root cgroup anymore:

yocto:~$ cat /sys/fs/cgroup/cpu,cpuacct/tasks | grep $(pidof voxl-qvio-server)
yocto:~$

But now it lives in a new group:

yocto:~$ cat /sys/fs/cgroup/cpu,cpuacct/system.slice/voxl-qvio-server.service/tasks | grep $(pidof voxl-qvio-server)
13562

but this new group doesn't have any realtime runtime budget:

yocto:~$ cat /sys/fs/cgroup/cpu,cpuacct/system.slice/voxl-qvio-server.service/cpu.rt_runtime_us                     
0

and therefore it is unable to set the scheduler to FIFO with high priority. This is also mentioned in https://www.kernel.org/doc/Documentation/scheduler/sched-rt-group.txt at section 2.2:

By default all bandwidth is assigned to the root group and new groups get the
period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you
want to assign bandwidth to another group, reduce the root group's bandwidth
and assign some or all of the difference to another group.

If I manually assign bandwidth/realtime runtime budget to the voxl-qvio-server group it is able to set the scheduler to FIFO with high priority

echo 550000 > /sys/fs/cgroup/cpu,cpuacct/cpu.rt_runtime_us
echo 200000 > /sys/fs/cgroup/cpu,cpuacct/system.slice/cpu.rt_runtime_us
echo 200000 > /sys/fs/cgroup/cpu,cpuacct/system.slice/voxl-qvio-server.service/cpu.rt_runtime_us

I tried to script this and add it to the service but then it fails at the first startup because then /sys/fs/cgroup/cpu,cpuacct/system.slice/voxl-qvio-server.service/cpu.rt_runtime_us doesn't exist yet. Maybe it is possible to add it conditionally but I wasn't able to get it working robust yet. This link looks also interesting: https://lists.freedesktop.org/archives/systemd-devel/2017-July/039353.html. But at this point I thought it was better to ask on this forum if you are aware of this problem and maybe already have a solution for this.

My service file is this:

yocto:~$ cat /etc/systemd/system/voxl-qvio-server.service
#
# Copyright (c) 2021 ModalAI, Inc.
#

[Unit]
Description=voxl-qvio-server
SourcePath=/usr/bin/voxl-qvio-server
After=voxl-wait-for-fs.service
Requires=voxl-wait-for-fs.service

[Service]
User=root
Type=simple
PIDFile=/run/voxl-qvio-server.pid
ExecStart=/usr/bin/voxl-qvio-server

[Install]
WantedBy=multi-user.target

I hope you can help me out. I think it is important that voxl-qvio-server is always running with the FIFO scheduler and high priority.

James Strawson

Thanks for investigating this. I was unable to recreate this over ADB but was able to over SSH. Are you logged in through ADB or SSH?

Tjark

I was logged in through SSH. But we also have our own blowup handler running on the drone which can execute systemctl restart voxl-qvio-server when it has detected a blowup. Then we are not connected through SSH or ADB but it will also fail to set the scheduler to FIFO. And I'm not sure if the voxl-qvio-server is restarted by ModalAI software but then I think it is the same result.

Tjark

Is there any update on this issue?

Hi,

We're still not sure what caused the scheduler issue, it was a new feature we were hoping to integrate into the stack when you found this bug but aren't able to right now. In the meantime we've pushed updated packages with these calls removed so that all of the packages can run as intended. If you update the packages via OPKG to latest there should be no issues with the scheduler.