'ERROR failed to set scheduler' after restart of voxl-qvio-server

  • When voxl-qvio-server is started, it sets itself to use the FIFO scheduler with high priority. This is done here: https://gitlab.com/voxl-public/voxl-sdk/services/voxl-qvio-server/-/blob/master/server/main.cpp#L973

    When the drone boots, this is successful the first time. journalctl -u voxl-qvio-server reports:

    Jan 01 00:00:08 Drone_201 voxl-qvio-server[2552]: setting scheduler
    Jan 01 00:00:08 Drone_201 voxl-qvio-server[2552]: set FIFO priority successfully!

    Now when I execute systemctl restart voxl-qvio-server it is unable to set itself to use the FIFO scheduler with high priority. journalctl -u voxl-qvio-server reports:

    Jul 01 08:23:52 Drone_201 voxl-qvio-server[11943]: WARNING Failed to set priority, errno = 1
    Jul 01 08:23:52 Drone_201 voxl-qvio-server[11943]: This seems to be a problem with ADB, the scheduler
    Jul 01 08:23:52 Drone_201 voxl-qvio-server[11943]: should work properly when this is a background process
    Jul 01 08:23:52 Drone_201 voxl-qvio-server[11943]: ERROR failed to set scheduler

    It seems that this is similar to the issue reported here: https://bugzilla.redhat.com/show_bug.cgi?id=1467919.

    What can we do to make sure that voxl-qvio-server is always running with the FIFO scheduler?

    Version information:

    yocto:~$ opkg list | grep voxl
    libvoxl_cutils - 0.0.2 - ModalAI's c utils
    libvoxl_io - 0.5.4 - ModalAI library allowing apps processor access to accessory serial ports
    voxl-camera-calibration - 0.0.1 - On-board camera calibration for VOXL
    voxl-camera-server - 0.8.1 - publishes camera frames over named pipe interface
    voxl-cpu-monitor - 0.1.7 - publishes CPU Data over MPA pipe and provides fan tools
    voxl-docker-support - 1.1.3 - tools to improve the usability of docker on VOXL
    voxl-imu-server - 0.8.1 - VOXL IMU interface for Modal Pipe Architecture
    voxl-mavlink - 0.0.2 - mavlink headers
    voxl-mpa-tools - 0.2.7 - misc tools for modal pipe architecture
    voxl-nodes - 0.1.7 - ROS nodes supported by ModalAI
    voxl-portal - 0.1.2
    voxl-qvio-server - 0.3.1 - publishes QVIO data over named pipe interface
    voxl-qvio-server - 0.3.4
    voxl-streamer - 0.2.3 - Gstreamer-based application to handle RTSP streaming
    voxl-tag-detector - 0.0.2 - Detect apriltags for MPA
    voxl-tflite - 0.0.1 - 64-bit tensorflow lite libraries
    voxl-tflite-server - 0.1.1 - client of voxl-camera-server that does deep learning (object detection, monocular depth estimation)
    voxl-utils - 0.8.5
    voxl-vision-px4 - 0.9.2 - Interface between VOXL's computer vision services and PX4

    Things I already have figured out from looking at the link mentioned above.

    When the drone starts up the voxl-qvio-server task lives in the root cgroup:

    yocto:~$ cat /sys/fs/cgroup/cpu,cpuacct/tasks | grep $(pidof voxl-qvio-server)

    Therefore it uses the realtime runtime budget of the root. This is 0.95 seconds per second (default values):

    yocto:~$ cat /sys/fs/cgroup/cpu,cpuacct/cpu.rt_runtime_us

    It needs a realtime runtime budget bigger than 0 to be able to set the scheduler to FIFO so this is good.

    Now when I execute systemctl restart voxl-qvio-server things change. The task doesn't live in the root cgroup anymore:

    yocto:~$ cat /sys/fs/cgroup/cpu,cpuacct/tasks | grep $(pidof voxl-qvio-server)

    But now it lives in a new group:

    yocto:~$ cat /sys/fs/cgroup/cpu,cpuacct/system.slice/voxl-qvio-server.service/tasks | grep $(pidof voxl-qvio-server)

    but this new group doesn't have any realtime runtime budget:

    yocto:~$ cat /sys/fs/cgroup/cpu,cpuacct/system.slice/voxl-qvio-server.service/cpu.rt_runtime_us                     

    and therefore it is unable to set the scheduler to FIFO with high priority. This is also mentioned in https://www.kernel.org/doc/Documentation/scheduler/sched-rt-group.txt at section 2.2:

    By default all bandwidth is assigned to the root group and new groups get the
    period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you
    want to assign bandwidth to another group, reduce the root group's bandwidth
    and assign some or all of the difference to another group.

    If I manually assign bandwidth/realtime runtime budget to the voxl-qvio-server group it is able to set the scheduler to FIFO with high priority

    echo 550000 > /sys/fs/cgroup/cpu,cpuacct/cpu.rt_runtime_us
    echo 200000 > /sys/fs/cgroup/cpu,cpuacct/system.slice/cpu.rt_runtime_us
    echo 200000 > /sys/fs/cgroup/cpu,cpuacct/system.slice/voxl-qvio-server.service/cpu.rt_runtime_us

    I tried to script this and add it to the service but then it fails at the first startup because then /sys/fs/cgroup/cpu,cpuacct/system.slice/voxl-qvio-server.service/cpu.rt_runtime_us doesn't exist yet. Maybe it is possible to add it conditionally but I wasn't able to get it working robust yet. This link looks also interesting: https://lists.freedesktop.org/archives/systemd-devel/2017-July/039353.html. But at this point I thought it was better to ask on this forum if you are aware of this problem and maybe already have a solution for this.

    My service file is this:

    yocto:~$ cat /etc/systemd/system/voxl-qvio-server.service
    # Copyright (c) 2021 ModalAI, Inc.

    I hope you can help me out. I think it is important that voxl-qvio-server is always running with the FIFO scheduler and high priority.

  • Dev Team

    Thanks for investigating this. I was unable to recreate this over ADB but was able to over SSH. Are you logged in through ADB or SSH?

  • I was logged in through SSH. But we also have our own blowup handler running on the drone which can execute systemctl restart voxl-qvio-server when it has detected a blowup. Then we are not connected through SSH or ADB but it will also fail to set the scheduler to FIFO. And I'm not sure if the voxl-qvio-server is restarted by ModalAI software but then I think it is the same result.

  • Is there any update on this issue?

  • Dev Team


    We're still not sure what caused the scheduler issue, it was a new feature we were hoping to integrate into the stack when you found this bug but aren't able to right now. In the meantime we've pushed updated packages with these calls removed so that all of the packages can run as intended. If you update the packages via OPKG to latest there should be no issues with the scheduler.

Log in to reply