'ERROR failed to set scheduler' after restart of voxl-qvio-server
When voxl-qvio-server is started, it sets itself to use the FIFO scheduler with high priority. This is done here: https://gitlab.com/voxl-public/voxl-sdk/services/voxl-qvio-server/-/blob/master/server/main.cpp#L973
When the drone boots, this is successful the first time.
journalctl -u voxl-qvio-serverreports:
Jan 01 00:00:08 Drone_201 voxl-qvio-server: setting scheduler Jan 01 00:00:08 Drone_201 voxl-qvio-server: set FIFO priority successfully!
Now when I execute
systemctl restart voxl-qvio-serverit is unable to set itself to use the FIFO scheduler with high priority.
journalctl -u voxl-qvio-serverreports:
Jul 01 08:23:52 Drone_201 voxl-qvio-server: WARNING Failed to set priority, errno = 1 Jul 01 08:23:52 Drone_201 voxl-qvio-server: This seems to be a problem with ADB, the scheduler Jul 01 08:23:52 Drone_201 voxl-qvio-server: should work properly when this is a background process Jul 01 08:23:52 Drone_201 voxl-qvio-server: ERROR failed to set scheduler
It seems that this is similar to the issue reported here: https://bugzilla.redhat.com/show_bug.cgi?id=1467919.
What can we do to make sure that voxl-qvio-server is always running with the FIFO scheduler?
yocto:~$ opkg list | grep voxl libvoxl_cutils - 0.0.2 - ModalAI's c utils libvoxl_io - 0.5.4 - ModalAI library allowing apps processor access to accessory serial ports voxl-camera-calibration - 0.0.1 - On-board camera calibration for VOXL voxl-camera-server - 0.8.1 - publishes camera frames over named pipe interface voxl-cpu-monitor - 0.1.7 - publishes CPU Data over MPA pipe and provides fan tools voxl-docker-support - 1.1.3 - tools to improve the usability of docker on VOXL voxl-imu-server - 0.8.1 - VOXL IMU interface for Modal Pipe Architecture voxl-mavlink - 0.0.2 - mavlink headers voxl-mpa-tools - 0.2.7 - misc tools for modal pipe architecture voxl-nodes - 0.1.7 - ROS nodes supported by ModalAI voxl-portal - 0.1.2 voxl-qvio-server - 0.3.1 - publishes QVIO data over named pipe interface voxl-qvio-server - 0.3.4 voxl-streamer - 0.2.3 - Gstreamer-based application to handle RTSP streaming voxl-tag-detector - 0.0.2 - Detect apriltags for MPA voxl-tflite - 0.0.1 - 64-bit tensorflow lite libraries voxl-tflite-server - 0.1.1 - client of voxl-camera-server that does deep learning (object detection, monocular depth estimation) voxl-utils - 0.8.5 voxl-vision-px4 - 0.9.2 - Interface between VOXL's computer vision services and PX4
Things I already have figured out from looking at the link mentioned above.
When the drone starts up the voxl-qvio-server task lives in the root cgroup:
yocto:~$ cat /sys/fs/cgroup/cpu,cpuacct/tasks | grep $(pidof voxl-qvio-server) 2514
Therefore it uses the realtime runtime budget of the root. This is 0.95 seconds per second (default values):
yocto:~$ cat /sys/fs/cgroup/cpu,cpuacct/cpu.rt_runtime_us 950000
It needs a realtime runtime budget bigger than 0 to be able to set the scheduler to FIFO so this is good.
Now when I execute
systemctl restart voxl-qvio-serverthings change. The task doesn't live in the root cgroup anymore:
yocto:~$ cat /sys/fs/cgroup/cpu,cpuacct/tasks | grep $(pidof voxl-qvio-server) yocto:~$
But now it lives in a new group:
yocto:~$ cat /sys/fs/cgroup/cpu,cpuacct/system.slice/voxl-qvio-server.service/tasks | grep $(pidof voxl-qvio-server) 13562
but this new group doesn't have any realtime runtime budget:
yocto:~$ cat /sys/fs/cgroup/cpu,cpuacct/system.slice/voxl-qvio-server.service/cpu.rt_runtime_us 0
and therefore it is unable to set the scheduler to FIFO with high priority. This is also mentioned in https://www.kernel.org/doc/Documentation/scheduler/sched-rt-group.txt at section 2.2:
By default all bandwidth is assigned to the root group and new groups get the
period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you
want to assign bandwidth to another group, reduce the root group's bandwidth
and assign some or all of the difference to another group.
If I manually assign bandwidth/realtime runtime budget to the voxl-qvio-server group it is able to set the scheduler to FIFO with high priority
echo 550000 > /sys/fs/cgroup/cpu,cpuacct/cpu.rt_runtime_us echo 200000 > /sys/fs/cgroup/cpu,cpuacct/system.slice/cpu.rt_runtime_us echo 200000 > /sys/fs/cgroup/cpu,cpuacct/system.slice/voxl-qvio-server.service/cpu.rt_runtime_us
I tried to script this and add it to the service but then it fails at the first startup because then
/sys/fs/cgroup/cpu,cpuacct/system.slice/voxl-qvio-server.service/cpu.rt_runtime_usdoesn't exist yet. Maybe it is possible to add it conditionally but I wasn't able to get it working robust yet. This link looks also interesting: https://lists.freedesktop.org/archives/systemd-devel/2017-July/039353.html. But at this point I thought it was better to ask on this forum if you are aware of this problem and maybe already have a solution for this.
My service file is this:
yocto:~$ cat /etc/systemd/system/voxl-qvio-server.service # # Copyright (c) 2021 ModalAI, Inc. # [Unit] Description=voxl-qvio-server SourcePath=/usr/bin/voxl-qvio-server After=voxl-wait-for-fs.service Requires=voxl-wait-for-fs.service [Service] User=root Type=simple PIDFile=/run/voxl-qvio-server.pid ExecStart=/usr/bin/voxl-qvio-server [Install] WantedBy=multi-user.target
I hope you can help me out. I think it is important that voxl-qvio-server is always running with the FIFO scheduler and high priority.
Thanks for investigating this. I was unable to recreate this over ADB but was able to over SSH. Are you logged in through ADB or SSH?
I was logged in through SSH. But we also have our own blowup handler running on the drone which can execute
systemctl restart voxl-qvio-serverwhen it has detected a blowup. Then we are not connected through SSH or ADB but it will also fail to set the scheduler to FIFO. And I'm not sure if the voxl-qvio-server is restarted by ModalAI software but then I think it is the same result.
Is there any update on this issue?
We're still not sure what caused the scheduler issue, it was a new feature we were hoping to integrate into the stack when you found this bug but aren't able to right now. In the meantime we've pushed updated packages with these calls removed so that all of the packages can run as intended. If you update the packages via OPKG to latest there should be no issues with the scheduler.