PX4 qmi_error abort
-
@Eric-Katzfey The first distro I see modalai-slpi in is http://voxl-packages.modalai.com/dists/qrb5165/sdk-1.2/binary-arm64/
We install http://voxl-packages.modalai.com/dists/qrb5165/sdk-1.0/binary-arm64/ which is probably why I don't see it in
voxl-version
!Is it okay for me to install the latest distro's (http://voxl-packages.modalai.com/dists/qrb5165/sdk-1.4/binary-arm64/) modalai-slpi on my voxl2 alongside the existing older software, or will that break anything?
If modalai-slpi has something to do with PX4 communication with the SLPI, how is it possible that I don't have any version of modalai-slpi installed but PX4 can still run software on the SLPI?
Thank you,
Rowan
-
@Rowan-Dempster The SLPI image used to be part of the main system image. It was then separated out into it's own package for easier maintenance. So you have a very old version missing many important bug fixes. We've never tried installing the latest modalai-slpi package on an old SDK but I think it will work. Give it a try and see what happens. But, obviously, it's really hard for us to support you when you use such old software with custom modifications on top of it. You should really try to make any customization such that they can easily be used with newer versions of VOXL SDK as they come out.
-
The SLPI image used to be part of the main system image. It was then separated out into it's own package for easier maintenance.
Gotcha makes sense!
So you have a very old version missing many important bug fixes. We've never tried installing the latest modalai-slpi package on an old SDK but I think it will work. Give it a try and see what happens.
Will do, just wanted to confirm that it "might work" so not totally wasting my time exploring this avenue haha.
But, obviously, it's really hard for us to support you when you use such old software with custom modifications on top of it. You should really try to make any customization such that they can easily be used with newer versions of VOXL SDK as they come out.
Yes this is something that we constantly run into at Cleo as a small company trying release a stable product but also keep up with the latest and greatest from modal and other open source vendors. As a dev at Cleo it feels like two things are pulling in opposite directions:
- Having your base platform constantly updating, which then requires patches for API changes and sometimes more low level incompatibilities that come along with those updates.
- Getting the stability and functional improvements that come along with those base platform updates.
I'm sure these competing forces are felt by others as well, not just at Cleo. It's a conversation that is perhaps worthy of a call between Cleo and Modal devs to get aligned on the best way to get the stability and functional improvements from each vendor (modalai) release while at the same time minimizing the Cleo dev time needed to do those API patches and minimize incompatibilities or at least forecast them before spending the time trying to do an upgrade and then finding a incompatibility.
-
Just following up on the testing that Cleo did with @Eric-Katzfey 's suggestion of installing http://voxl-packages.modalai.com/dists/qrb5165/sdk-1.4/binary-arm64/modalai-slpi_1.1.19_arm64.deb :
Before installing the new package
qmi_client_send_msg_sync
at PX4 startup during boot number 48qmi_client_send_msg_sync
at PX4 startup during boot number 62qmi_client_send_msg_sync
at PX4 startup during boot number 95qmi_client_send_msg_sync
at PX4 startup during boot number 131qmi_client_send_msg_sync
at PX4 startup during boot number 22
After installing the new package:
- 550 boots in a row without any failures during PX4 startup
Going forward Cleo will be installing http://voxl-packages.modalai.com/dists/qrb5165/sdk-1.4/binary-arm64/modalai-slpi_1.1.19_arm64.deb on all dronuts we build.
Thank you for your help @Eric-Katzfey !
-
@Eric-Katzfey Unfortunately it seems like the specific crash that was happening at 12 seconds after power in boot up was only one of the issues. After boot up is completed we are still seeing PX4 crashes with the same error message at about 77 seconds after power up:
Mar 02 12:59:17 m0054 voxl-px4[1832]: terminate called after throwing an instance of 'qmi_error' Mar 02 12:59:17 m0054 voxl-px4[1832]: what(): qmi_client_send_msg_sync() failed, (client_id=)0, result=0: qmi service error (-2) Mar 02 12:59:17 m0054 voxl-px4[1832]: /usr/bin/voxl-px4: line 140: 1838 Aborted GPS=$GPS RC=$RC OSD=$OSD EXTRA_STEPS=$EXTRA_STEPS px4 $DAEMON -s /usr/bin/voxl-px4-start Mar 02 12:59:17 m0054 systemd[1]: [0;1;39m[0;1;31m[0;1;39mvoxl-px4.service: Main process exited, code=exited, status=134/n/a[0m Mar 02 12:59:17 m0054 systemd[1]: [0;1;39m[0;1;31m[0;1;39mvoxl-px4.service: Failed with result 'exit-code'.[0m
We were also able to get the
dmesg
from this system boot and these messages line up with when PX4 crashed and output theqmi_error
:[ 77.107460] Fatal error on slpi! [ 77.107529] slpi subsystem failure reason: err_qdi.c:1079:PC=e61fc160,SP=317931e8,FP=31793268,LR=e621d784,BADVA=0,CAUSE=7003,TASK=Anonymous. [ 77.107564] subsys-restart: subsystem_restart_dev(): Restart sequence requested for slpi, restart_level = RELATED. [ 77.108605] adsprpc: fastrpc_restart_notifier_cb: slpi subsystem is restarting [ 77.108612] subsys-restart: subsystem_shutdown(): [kworker/u19:0:2966]: Shutting down slpi [ 77.120971] qcom_rpmh DRV:apps_rsc TCS Busy, retrying RPMH message send: addr=0x30030 [ 77.123099] adsprpc: fastrpc_rpmsg_remove: closed rpmsg channel of slpi [ 77.123533] adsprpc: fastrpc_restart_notifier_cb: received RAMDUMP notification for slpi [ 77.123932] coresight-remote-etm soc:ssc_etm0: Connection disconnected between QMI handle and 8 service [ 77.123941] sysmon-qmi: ssctl_del_server: Connection lost between QMI handle and slpi's SSCTL service [ 77.124485] subsys-restart: subsystem_powerup(): [kworker/u19:0:2966]: Powering up slpi [ 77.124863] subsys-pil-tz 5c00000.qcom,ssc: slpi: loading from 0x0000000088c00000 to 0x000000008a600000 [ 77.198746] subsys-pil-tz 5c00000.qcom,ssc: slpi: Brought out of reset [ 77.254413] subsys-pil-tz 5c00000.qcom,ssc: Subsystem error monitoring/handling services are up [ 77.254573] subsys-pil-tz 5c00000.qcom,ssc: slpi: Power/Clock ready interrupt received [ 77.259994] adsprpc: fastrpc_restart_notifier_cb: slpi subsystem is up [ 77.259999] subsys-restart: subsystem_restart_wq_func(): [kworker/u19:0:2966]: Restart sequence for slpi completed. [ 77.261053] -1836034584:Entered [ 77.264781] -1836034584:SMD QRTR driver probed [ 77.267518] sysmon-qmi: ssctl_new_server: Connection established between QMI handle and slpi's SSCTL service [ 77.267568] coresight-remote-etm soc:ssc_etm0: Connection established between QMI handle and 8 service [ 77.268271] adsprpc: fastrpc_rpmsg_probe: opened rpmsg channel for slpi [ 77.274585] diag: In diag_send_peripheral_buffering_mode, buffering flag not set for 3
-
@Rowan-Dempster That program counter (PC=0xe61fc160) indicates that the crash happened in the loaded px4 library.
-
@Rowan-Dempster There is a way to figure out where in the code the crash happened. I can update the modalai-slpi package to add a debug print in mini-dm to show the address where libpx4.so was loaded into memory. I just ran that and it showed that libpx4.so was loaded at address 0xe6120000. Once you know that you can disassemble libpx4.so to get the address map (For example:
/local/mnt/workspace/Qualcomm/Hexagon_SDK/4.1.0.4/tools/HEXAGON_Tools/8.4.05/Tools/bin/hexagon-llvm-objdump -d build/modalai_voxl2-slpi_default/platforms/qurt/libpx4.so > dsp-image.dis
). Take the address you get for the PC in dmesg (in this case 0xe61fc160), subtract off the base address (0xe61fc160 - 0xe6120000 = 0xdc160), then look up that address in the disassembled file. That will show you where it crashed. You also get the LR in the fatal error message so that can help show where it was called from. -
-
@Eric-Katzfey Thank you for the awesome debugging tools! We are looking into narrowing down the crash using them today.
We also noticed that going from
modalai-slpi_1.1.19_arm64.deb
tomodalai-slpi_1.1.20-202504131441_arm64.deb
also decreased the CPU util reported by mini-dm by 10% (36% to 26%), is that expected? What changed from 1.1.19 to 1.1.20? Which should we be using in production?Thank you!
-
@Rowan-Dempster First of all, the CPU report for DSP is only an estimate. It's really tough to get a good CPU estimate due to the way it "sleeps" and it's parallel architecture. But I have seen that before where there is a difference of 10% between builds and I don't know what causes that yet. So I wouldn't be too worried about it. v1.20 only adds that debug print during startup so shouldn't be any extra risk above moving to v1.19.
-
@Eric-Katzfey The object dump debugging method was super helpful thank you! We tracked down the null dereference within 30 minutes with these new tools, something that would have taken weeks or not be solved at all without those tools.
-
@Rowan-Dempster No problem! Kind of primitive but effective