DSP Tasks Failing Unless mini-dm is Run

Eric Katzfey

@ejohnson01 You said you are running a modified version of voxl-px4. Does this problem still happen with the unmodified version of voxl-px4 that came with the SDK release?

ejohnson01

@Eric-Katzfey that was a good question. today I setup my bench to run the px4 version in sdkv1.1.1 and at first I was unable to recreate the problem.

After further testing I noticed the behavior happening when we would upload our qgc param file from the previous version. My co-worker and I spent some time and isolated the parameters that were the problem children. For what ever reason the parameters below when uploaded via a QGC file is causing the problem

1	1	COM_FLTMODE1	-1	6
1	1	COM_FLTMODE2	1	6
1	1	COM_FLTMODE3	-1	6
1	1	COM_FLTMODE4	2	6
1	1	COM_FLTMODE5	-1	6
1	1	COM_FLTMODE6	2	6

I then set the bench back to the stable sdkv1.1.1 version and created a parameter file with only the parameters above and I was able to see the same problem.

Steps to recreate
make sure you run all commands over ssh and do not create an adb session.

load stable version of px4. (voxl-px4_1.14.0-2.0.59_arm64.deb)
remove old parameters by running rm -rf /data/px4/param/*
run voxl-px4 once to set the parameters back to default
power cycle
make sure sbus receiver and controller are powered on and connected.
run voxl-px4
open up QGC and the problem should be visible

I honestly have no Idea why uploading these params in particular break px4. we were able to upload all the other params in the file just fine.

Work around
We did discover a work around. If we upload all our old params but the ones listed above and then configure them via QGC the problem does not present itself. The super weird thing about this is that I can then download the parameters and diff them vs the old ones and the above parameters will show as identical.

Also I can not explain why connecting to the voxl with mini-dm once per boot resolves this issue. I did do further testing on though. It seems like if you open up an adb shell it also resolves this issue similar to using mini-dm, which i guess mini-dm works via adb so its probably the debug link that fixes this problem.

We are able to continue testing using our work around, but I would be very interested to see what the root of this issue is. Let me know if you all are able to recreate and or if you need more information.

Eric Katzfey

@ejohnson01 Wow, that's an odd one! Thanks for the update. Not sure when I can get to it but now I am also curious as to why that would cause an issue with px4.

ejohnson01

@ejohnson01 I realized I messed up the steps to recreate

Steps to recreate
make sure you run all commands over ssh and do not create an adb session.

load stable version of px4. (voxl-px4_1.14.0-2.0.59_arm64.deb)
remove old parameters by running rm -rf /data/px4/param/*
run voxl-px4 once to set the parameters back to default
power cycle
make sure sbus receiver and controller are powered on and connected.
run voxl-px4.
open up QGC
upload a QGC file with the above listed parameters
restart voxl-px4 and the problem should be present

bhanner-bell

I just wanted to add some more context to hopefully work towards a solution as @ejohnson01 and I been trying to overcome this issue for the better part of a week now.

I think that the parameter issue may have been a red herring because we have now been able to reproduce the issue in other ways.

We were able to find that using px4-uorb top when things are working, yields around ~350kB/s of traffic.

Whenever we plug in a joystick from QGC or power on a transmitter via SBUS things start going haywire (coincidentally right when the uorb message "manual_control_setpoint" starts getting published). The data rate on uorb top tanks by over half, things start becoming unresponsive, and they don't recover until a PX4 restart.

Are there ways to profile the PX4 tasks that are running on the SLPI? The Linux side shows ~70% of a single core and mini-dm is showing ~70% utilization of the SLPI when things are going bad (contrasted by, I believe, ~50% on Linux and ~45% on SLPI when things are working).

I just cant help but think there is some task running either on Linux or on the SLPI that is occupying too many resources or filling a buffer that cant be cleared or something along those lines but we can't find the right tool to profile what's going on.

Eric Katzfey

@bhanner-bell Hmmm, yes, 70% is a bit high. I generally like to see things 60% or lower. There isn't really a good tool to do that kind of profiling. There is a system level profiler included with the Hexagon SDK from Qualcomm that allows you to see how much CPU time each thread is taking. I have used that in the past to help me determine thread priorities. But it would be nice to have a statistical profiler that can tell you which lines of code are running most often. I think it would be possible to create a statistical profiler but it would be a fair amount of work. Without that you will have to use trial and error. Selectively start different drivers and modules to try and determine which ones are consuming the most resources. In normal use the CPU load goes way down after arming the drone and flying. There seems to be a bunch of preflight checking that the commander module does that consumes a lot of time and once the flight starts it no longer runs those checks. A weakness of the DSP is that it doesn't have HW support for floating point numbers so it has to use a software library for floating point operations. Also, the current version of PX4 (our voxl-dev branch) logs CPU load for the DSP. But without a new base DSP system image it will always report 0.1%. If you want to start using that I can provide the debian package that updates the DSP system image with this capability. That way you can see what happens after you arm and fly.

bhanner-bell

Hey @Eric-Katzfey we'd definitely appreciate a new dsp system image to test with.

I'm just going to dump some more info here in hopes that it may spark a thought:

We've been poking around all day and noticed there have been some recent software updates so I updated another one of our units to the SDK 1.1.1 image with our modified PX4-firmware version (it has added in vtol apps and some drivers for dsp_sbus and other sensors.. we are going to try again with 100% vanilla sdk 1.1.1 to make sure its present there too but we tried once before and experienced the same issues)

What I was able to observe when the system starts to become unresponsive is that the RAM consumption starts ballooning. voxl-px4 uses about 92364kb on htop when running normally but once the issue crops up the memory starts to increase and goes on forever until stopped... up into the hundreds of MB range.

Based on observed system behavior we believe the ballooning ram is due to a queue of uorb messages piling up. We can move the vehicle and observe many seconds later the output on the QGC AHRS indicator. The longer PX4 runs the longer the delay in the input->response loop.

We haven't been able to reliably trigger the fault but it seems significantly harder to trigger the fault if the usb c cable is plugged into the voxl and a host linux machine. Interestingly enough we arent actually running and adb commands or anything like that.. its just connected. With the cable completely disconnected the fault is pretty easy to reproduce: just power up the board and start voxl-px4.. if the radio (or a joystick to QGC) px4 seems to have problems.

Eric Katzfey

@bhanner-bell I sent the DSP system image debian package to the both of you in email.

Eric Katzfey

@Eric-Katzfey Seems like your email server rejected the email with the attachment.

Eric Katzfey

@Eric-Katzfey Try wget https://storage.googleapis.com/modalai_public/forum/modalai-slpi_1.1-12_arm64.deb

bhanner-bell

We were able to update the dsp image and also build voxl-px4 from the sdk-1.1.2 tag.

We are able to reproduce this issue with just a joystick connected to QGC (meaning RC in mode set to joystick only and voxl-io board disconnected).

When the issue occurs, this is what the cpuload topic shows:

Every 0.1s: px4-listener cpuload                                                                                                                                                    m0054: Thu Mar  2 13:11:09 2023


TOPIC: cpuload 2 instances

Instance 0:
 cpuload
    timestamp: 730351047 (59.056290 seconds ago)
    process_load: 0.86000
    system_load: 0.86000
    ram_usage: 0.00000
    platform: "QURT"



Instance 1:
 cpuload
    timestamp: 788949259 (0.459716 seconds ago)
    process_load: 0.52000
    system_load: 0.12578
    ram_usage: 0.11849
    platform: "POSIX"

Interesting to note: the timestamp starts to fall behind for most topics published by the QURT side of things...
Memory ballooning is still here.

Eric Katzfey

@bhanner-bell Wow, 86%, that's really high.

ejohnson01

@Eric-Katzfey

I did a lot of trial and error today. I am 95% sure I just found the problem.

First I got my bench in the messed up state. Then I rolled voxl-px4 back to sdk version 1.0.0 and noticed I was not able to recreate the issue. I immediately after ran dpkg -i on the sdk1.1.2 version and it was broken.

I then proceeded to binary search test all of the commits between 1.0.0 and 1.1.2. After some time I came to the following commits.

b746ab9434b7e4e71f67c9047ea3d3d49de81c00 - working
41a57bc30a40c42990995d5b4e8fe72389f66902 - broken

I saved these debs so i could switch between them multiple times and sure enough every time I switched to 41a57... I would see link loss issue. as soon as I rolled back to b746a... fixed.

here is a link to the changes in 41a...
https://github.com/modalai/px4-firmware/commit/41a57bc30a40c42990995d5b4e8fe72389f66902

the problem is caused by the removal of this line

qshell commander mode manual

I then rolled back to the sdk 1.1.2 commit and added qshell commander mode manual back to the start script and sure enough I can no longer recreate the link loss issue. I just figured this out and its almost 6pm on a friday so I have not had time to look into why this might break the rc control, but I suspect something is not getting initialized correctly without that call.

I love to help you all recreate this on your end and get to the true root of the issue. Let me know if there is any other information you would like me to provide.

Eric Katzfey

@ejohnson01 Okay, thanks for tracking that down. The commander used to come up in AUTO_LOITER mode by default which runs the CPU way too high (as you have witnessed) so I had the explicit command in the startup script to put it into manual mode. The commit changes the default mode to MANUAL so that the explicit statement in the startup script is no longer necessary. That seemed to be working for me but doesn't seem to be working in your case for some reason. If you query the commander status when you remove that what mode does it report?

ejohnson01

@Eric-Katzfey

As we have done more testing of the system, we have noticed that one of the symptoms of the issue is that the control inputs become laggy. All inputs from the controller are seen on uorb with more and more time delay over time. For instance sometimes we will let the voxl run for a minute and then look at watch -n0 px4-listener manual_control_setpoint and we will not see the corresponding input show up for 30 seconds. The delay gets worse over time. I was under the impression that uorb had a queue size limit, which if the case, it does not seem to be enforced. If the queue lmit was enforce, I would expect to see a maximum delay of the time to process the size of the queue. I see in a lot of places of the code that there are 16 message length queues so a task running a 50 hz should at most see lag of 800ms (16 * 50ms). I think this might imply a more systemic issue as a few minutes ago we saw the cpuload message get over a minute behind in processing.

Does this sound correct to you?

Also, is there any possibility of setting up a call to dig into this issue in more depth?

ejohnson01

@Eric-Katzfey

When I read this a second time I noticed you told me to check the commander mode. I noticed that when I used the game pad the commander mode was put into POS_HOLD. When this happens it seems to cause the lag to build. If I command the vehicle to switch to STABILIZED or ACRO eventually the system seems to catch up and we see both uorb messages be shown realtime and we can see the HUD update mostly real time as well.

I am guessing the issue has more to do with being in a mode that requires position lock. Interestingly enough, when we have brought the aircraft to the field and booted it with clear view of the sky so that the GPS can get lock, we have seen the aircraft either not have the lag and link loss or have lag and link loss for a few seconds then go away.

You said that you thought mode AUTO_LOITER used too much CPU, do you happen to have any thoughts on which of the modules are using the CPU or where the problem might be originating?

Eric Katzfey

@ejohnson01 We noticed that AUTO_LOITER mode was taking a lot of CPU sort of by accident and switched it to default to MANUAL. I have not gone back to determine what was taking so much time in AUTO_LOITER.

Eric Katzfey

@ejohnson01 I plan to spend some time looking into why POSCTL is taking so much CPU time. AUTO_LOITER hasn't been a very important mode for us but POSCTL certainly is.

Eric Katzfey

@ejohnson01 I have started to look into this and I can recreate the issue. So I'll dig in and see if I can figure out a fix.

Eric Katzfey

@ejohnson01 Can you try bringing this commit into your build and see if it resolves the issue: https://github.com/modalai/px4-firmware/commit/f945d29064b4ab26617a3fb15fc770f5dcb993e9