bhanner-bell

bhanner-bell

I just wanted to add some more context to hopefully work towards a solution as @ejohnson01 and I been trying to overcome this issue for the better part of a week now.

I think that the parameter issue may have been a red herring because we have now been able to reproduce the issue in other ways.

We were able to find that using px4-uorb top when things are working, yields around ~350kB/s of traffic.

Whenever we plug in a joystick from QGC or power on a transmitter via SBUS things start going haywire (coincidentally right when the uorb message "manual_control_setpoint" starts getting published). The data rate on uorb top tanks by over half, things start becoming unresponsive, and they don't recover until a PX4 restart.

Are there ways to profile the PX4 tasks that are running on the SLPI? The Linux side shows ~70% of a single core and mini-dm is showing ~70% utilization of the SLPI when things are going bad (contrasted by, I believe, ~50% on Linux and ~45% on SLPI when things are working).

I just cant help but think there is some task running either on Linux or on the SLPI that is occupying too many resources or filling a buffer that cant be cleared or something along those lines but we can't find the right tool to profile what's going on.

bhanner-bell

We were able to update the dsp image and also build voxl-px4 from the sdk-1.1.2 tag.

We are able to reproduce this issue with just a joystick connected to QGC (meaning RC in mode set to joystick only and voxl-io board disconnected).

When the issue occurs, this is what the cpuload topic shows:

Every 0.1s: px4-listener cpuload                                                                                                                                                    m0054: Thu Mar  2 13:11:09 2023


TOPIC: cpuload 2 instances

Instance 0:
 cpuload
    timestamp: 730351047 (59.056290 seconds ago)
    process_load: 0.86000
    system_load: 0.86000
    ram_usage: 0.00000
    platform: "QURT"



Instance 1:
 cpuload
    timestamp: 788949259 (0.459716 seconds ago)
    process_load: 0.52000
    system_load: 0.12578
    ram_usage: 0.11849
    platform: "POSIX"

Interesting to note: the timestamp starts to fall behind for most topics published by the QURT side of things...
Memory ballooning is still here.

bhanner-bell

Hey @Eric-Katzfey we'd definitely appreciate a new dsp system image to test with.

I'm just going to dump some more info here in hopes that it may spark a thought:

We've been poking around all day and noticed there have been some recent software updates so I updated another one of our units to the SDK 1.1.1 image with our modified PX4-firmware version (it has added in vtol apps and some drivers for dsp_sbus and other sensors.. we are going to try again with 100% vanilla sdk 1.1.1 to make sure its present there too but we tried once before and experienced the same issues)

What I was able to observe when the system starts to become unresponsive is that the RAM consumption starts ballooning. voxl-px4 uses about 92364kb on htop when running normally but once the issue crops up the memory starts to increase and goes on forever until stopped... up into the hundreds of MB range.

Based on observed system behavior we believe the ballooning ram is due to a queue of uorb messages piling up. We can move the vehicle and observe many seconds later the output on the QGC AHRS indicator. The longer PX4 runs the longer the delay in the input->response loop.

We haven't been able to reliably trigger the fault but it seems significantly harder to trigger the fault if the usb c cable is plugged into the voxl and a host linux machine. Interestingly enough we arent actually running and adb commands or anything like that.. its just connected. With the cable completely disconnected the fault is pretty easy to reproduce: just power up the board and start voxl-px4.. if the radio (or a joystick to QGC) px4 seems to have problems.

bhanner-bell