@Eric-Katzfey The object dump debugging method was super helpful thank you! We tracked down the null dereference within 30 minutes with these new tools, something that would have taken weeks or not be solved at all without those tools.
Posts made by Rowan Dempster
-
RE: PX4 qmi_error abort
-
RE: PX4 qmi_error abort
@Eric-Katzfey Thank you for the awesome debugging tools! We are looking into narrowing down the crash using them today.
We also noticed that going from
modalai-slpi_1.1.19_arm64.deb
tomodalai-slpi_1.1.20-202504131441_arm64.deb
also decreased the CPU util reported by mini-dm by 10% (36% to 26%), is that expected? What changed from 1.1.19 to 1.1.20? Which should we be using in production?Thank you!
-
RE: PX4 qmi_error abort
@Eric-Katzfey Unfortunately it seems like the specific crash that was happening at 12 seconds after power in boot up was only one of the issues. After boot up is completed we are still seeing PX4 crashes with the same error message at about 77 seconds after power up:
Mar 02 12:59:17 m0054 voxl-px4[1832]: terminate called after throwing an instance of 'qmi_error' Mar 02 12:59:17 m0054 voxl-px4[1832]: what(): qmi_client_send_msg_sync() failed, (client_id=)0, result=0: qmi service error (-2) Mar 02 12:59:17 m0054 voxl-px4[1832]: /usr/bin/voxl-px4: line 140: 1838 Aborted GPS=$GPS RC=$RC OSD=$OSD EXTRA_STEPS=$EXTRA_STEPS px4 $DAEMON -s /usr/bin/voxl-px4-start Mar 02 12:59:17 m0054 systemd[1]: [0;1;39m[0;1;31m[0;1;39mvoxl-px4.service: Main process exited, code=exited, status=134/n/a[0m Mar 02 12:59:17 m0054 systemd[1]: [0;1;39m[0;1;31m[0;1;39mvoxl-px4.service: Failed with result 'exit-code'.[0m
We were also able to get the
dmesg
from this system boot and these messages line up with when PX4 crashed and output theqmi_error
:[ 77.107460] Fatal error on slpi! [ 77.107529] slpi subsystem failure reason: err_qdi.c:1079:PC=e61fc160,SP=317931e8,FP=31793268,LR=e621d784,BADVA=0,CAUSE=7003,TASK=Anonymous. [ 77.107564] subsys-restart: subsystem_restart_dev(): Restart sequence requested for slpi, restart_level = RELATED. [ 77.108605] adsprpc: fastrpc_restart_notifier_cb: slpi subsystem is restarting [ 77.108612] subsys-restart: subsystem_shutdown(): [kworker/u19:0:2966]: Shutting down slpi [ 77.120971] qcom_rpmh DRV:apps_rsc TCS Busy, retrying RPMH message send: addr=0x30030 [ 77.123099] adsprpc: fastrpc_rpmsg_remove: closed rpmsg channel of slpi [ 77.123533] adsprpc: fastrpc_restart_notifier_cb: received RAMDUMP notification for slpi [ 77.123932] coresight-remote-etm soc:ssc_etm0: Connection disconnected between QMI handle and 8 service [ 77.123941] sysmon-qmi: ssctl_del_server: Connection lost between QMI handle and slpi's SSCTL service [ 77.124485] subsys-restart: subsystem_powerup(): [kworker/u19:0:2966]: Powering up slpi [ 77.124863] subsys-pil-tz 5c00000.qcom,ssc: slpi: loading from 0x0000000088c00000 to 0x000000008a600000 [ 77.198746] subsys-pil-tz 5c00000.qcom,ssc: slpi: Brought out of reset [ 77.254413] subsys-pil-tz 5c00000.qcom,ssc: Subsystem error monitoring/handling services are up [ 77.254573] subsys-pil-tz 5c00000.qcom,ssc: slpi: Power/Clock ready interrupt received [ 77.259994] adsprpc: fastrpc_restart_notifier_cb: slpi subsystem is up [ 77.259999] subsys-restart: subsystem_restart_wq_func(): [kworker/u19:0:2966]: Restart sequence for slpi completed. [ 77.261053] -1836034584:Entered [ 77.264781] -1836034584:SMD QRTR driver probed [ 77.267518] sysmon-qmi: ssctl_new_server: Connection established between QMI handle and slpi's SSCTL service [ 77.267568] coresight-remote-etm soc:ssc_etm0: Connection established between QMI handle and 8 service [ 77.268271] adsprpc: fastrpc_rpmsg_probe: opened rpmsg channel for slpi [ 77.274585] diag: In diag_send_peripheral_buffering_mode, buffering flag not set for 3
-
RE: PX4 qmi_error abort
Just following up on the testing that Cleo did with @Eric-Katzfey 's suggestion of installing http://voxl-packages.modalai.com/dists/qrb5165/sdk-1.4/binary-arm64/modalai-slpi_1.1.19_arm64.deb :
Before installing the new package
qmi_client_send_msg_sync
at PX4 startup during boot number 48qmi_client_send_msg_sync
at PX4 startup during boot number 62qmi_client_send_msg_sync
at PX4 startup during boot number 95qmi_client_send_msg_sync
at PX4 startup during boot number 131qmi_client_send_msg_sync
at PX4 startup during boot number 22
After installing the new package:
- 550 boots in a row without any failures during PX4 startup
Going forward Cleo will be installing http://voxl-packages.modalai.com/dists/qrb5165/sdk-1.4/binary-arm64/modalai-slpi_1.1.19_arm64.deb on all dronuts we build.
Thank you for your help @Eric-Katzfey !
-
RE: PX4 qmi_error abort
The SLPI image used to be part of the main system image. It was then separated out into it's own package for easier maintenance.
Gotcha makes sense!
So you have a very old version missing many important bug fixes. We've never tried installing the latest modalai-slpi package on an old SDK but I think it will work. Give it a try and see what happens.
Will do, just wanted to confirm that it "might work" so not totally wasting my time exploring this avenue haha.
But, obviously, it's really hard for us to support you when you use such old software with custom modifications on top of it. You should really try to make any customization such that they can easily be used with newer versions of VOXL SDK as they come out.
Yes this is something that we constantly run into at Cleo as a small company trying release a stable product but also keep up with the latest and greatest from modal and other open source vendors. As a dev at Cleo it feels like two things are pulling in opposite directions:
- Having your base platform constantly updating, which then requires patches for API changes and sometimes more low level incompatibilities that come along with those updates.
- Getting the stability and functional improvements that come along with those base platform updates.
I'm sure these competing forces are felt by others as well, not just at Cleo. It's a conversation that is perhaps worthy of a call between Cleo and Modal devs to get aligned on the best way to get the stability and functional improvements from each vendor (modalai) release while at the same time minimizing the Cleo dev time needed to do those API patches and minimize incompatibilities or at least forecast them before spending the time trying to do an upgrade and then finding a incompatibility.
-
RE: PX4 qmi_error abort
@Eric-Katzfey The first distro I see modalai-slpi in is http://voxl-packages.modalai.com/dists/qrb5165/sdk-1.2/binary-arm64/
We install http://voxl-packages.modalai.com/dists/qrb5165/sdk-1.0/binary-arm64/ which is probably why I don't see it in
voxl-version
!Is it okay for me to install the latest distro's (http://voxl-packages.modalai.com/dists/qrb5165/sdk-1.4/binary-arm64/) modalai-slpi on my voxl2 alongside the existing older software, or will that break anything?
If modalai-slpi has something to do with PX4 communication with the SLPI, how is it possible that I don't have any version of modalai-slpi installed but PX4 can still run software on the SLPI?
Thank you,
Rowan
-
RE: PX4 qmi_error abort
@Eric-Katzfey I do not see
modal-slpi
in the output of voxl-version:voxl2:/$ voxl-version | grep slpi qrb5165-slpi-test-sig 01-r0 voxl-slpi-uart-bridge 1.0.1
Here is the full output:
voxl2:/$ voxl-version -------------------------------------------------------------------------------- system-image: 1.7.8-M0054-14.1a-perf kernel: #1 SMP PREEMPT Sat May 18 00:10:25 UTC 2024 4.19.125 -------------------------------------------------------------------------------- hw version: M0054 -------------------------------------------------------------------------------- voxl-suite: 1.0.0 -------------------------------------------------------------------------------- Packages: Repo: http://voxl-packages.modalai.com/ ./dists/qrb5165/sdk-1.0/binary-arm64/ Last Updated: 2023-03-02 13:01:31 List: kernel-module-voxl-fsync-mod-4.19.125 1.0-r0 kernel-module-voxl-gpio-mod-4.19.125 1.0-r0 kernel-module-voxl-platform-mod-4.19.125 1.0-r0 libmodal-c2d 0.1 libmodal-cv 0.3.2 libmodal-exposure 0.0.0+89cd3ac03 libmodal-journal 0.2.2 libmodal-json 0.4.3 libmodal-pipe 2.10.3 libqrb5165-io 0.3.3 libvoxl-cci-direct 0.2.3 libvoxl-cutils 0.1.1 mv-voxl 0.1-r0 qrb5165-bind 0.1-r0 qrb5165-dfs-server 0.1.0 qrb5165-imu-server 0.6.0 qrb5165-slpi-test-sig 01-r0 qrb5165-system-tweaks 0.2.2 qrb5165-tflite 2.8.0-2 voxl-bind-spektrum 0.1.0 voxl-camera-calibration 0.4.0 voxl-camera-server 0.0.0+89cd3ac03 voxl-configurator 0.2.7 voxl-cpu-monitor 0.4.6 voxl-docker-support 1.2.5 voxl-eigen3 3.4.0 voxl-elrs 0.0.7 voxl-esc 1.2.2 voxl-feature-tracker 0.2.3 voxl-flow-server 0.3.3 voxl-fsync-mod 1.0-r0 voxl-gphoto2-server 0.0.10 voxl-gpio-mod 1.0-r0 voxl-imu-server 0.0.0+89cd3ac03 voxl-jpeg-turbo 2.1.3-5 voxl-lepton-server 1.1.2 voxl-libgphoto2 0.0.4 voxl-libuvc 1.0.7 voxl-logger 0.3.4 voxl-mavcam-manager 0.5.1 voxl-mavlink 0.1.1 voxl-mavlink-server 1.2.0 voxl-microdds-agent 2.4.1-0 voxl-modem 1.0.5 voxl-mongoose 7.7.0-1 voxl-mpa-to-ros 0.3.6 voxl-mpa-tools 1.0.4 voxl-opencv 4.5.5-1 voxl-platform-mod 1.0-r0 voxl-portal 0.5.9 voxl-px4 1.14.0-2.0.36+deb voxl-px4-imu-server 0.1.2 voxl-px4-params 0.1.8 voxl-qvio-server 0.0.0+89cd3ac03 voxl-remote-id 0.0.8 voxl-slpi-uart-bridge 1.0.1 voxl-streamer 0.0.0+89cd3ac03 voxl-suite 1.0.0 voxl-tag-detector 0.0.4 voxl-tflite-server 0.3.1 voxl-utils 1.3.1 voxl-uvc-server 0.1.6
-
RE: PX4 qmi_error abort
@Eric-Katzfey Gotcha thanks for the info I didn't know about that! Is the version of modalai-slpi highly coupled with the version of PX4 that we are using, or can we update modalai-slpi to get bug fixes without having to worry about compatibility with a specific version of PX4?
I will look into which version of modalai-slpi we are using and get back to you!
-
RE: PX4 qmi_error abort
@Eric-Katzfey I am not familiar with the "modalai-slpi" codebase, could you elaborate on what that is.
-
RE: PX4 qmi_error abort
@Eric-Katzfey Thanks for the response!
Are you using a recent version of VOXL SDK?
Cleo branched off of your repo at this tag: https://github.com/modalai/px4-firmware/tree/v1.14.0-2.0.36-dev
Have you made any modifications to the SDK?
Yup we actively development on the PX4 modules, including the controllers and the EKF that run on the DSP.
So it may be our code running on the DSP causing the DSP crash, or it could be related to the bugs in the https://github.com/modalai/px4-firmware/tree/v1.14.0-2.0.36-dev tag itself that you mentioned have been fixed.
As far as a path forward, are there any methods you can suggest for inspecting the DSP to find the root cause of crashes? Things we can add to the code, perhaps a debug mode we can run the DSP modules in, etc
Also, do you know of bug fix commits in your repo's mainline that we at Cleo can attempt to backport to our fork and see if we also no longer see the DSP crashes?
Thank you for your help,
Rowan -
PX4 qmi_error abort
Hey ModalAI PX4 users, has anyone been running into
qmi_error
causing the PX4 process to abort? At Cleo it happens at boot about 1/20 or 1/100 times. After booting successfully it's more rare about 1/200 times or 1/500 times.
Here's the full error fromjournalctl
:terminate called after throwing an instance of 'qmi_error' Mar 19 15:33:57 m0054 voxl-px4[1854]: what(): qmi_client_send_msg_sync() failed, (client_id=)0, result=0: qmi service error (-2) Mar 19 15:33:57 m0054 voxl-px4[1854]: /usr/bin/voxl-px4: line 140: 1868 Aborted GPS=$GPS RC=$RC OSD=$OSD EXTRA_STEPS=$EXTRA_STEPS px4 $DAEMON -s /usr/bin/voxl-px4-start Mar 19 15:33:57 m0054 systemd[1]: voxl-px4.service: Main process exited, code=exited, status=134/n/a Mar 19 15:33:57 m0054 systemd[1]: voxl-px4.service: Failed with result 'exit-code'.
-
RE: Difference in A65 ToF output in Royale 4 vs. 5
(top is Royale 4, bottom is 5)
-
Difference in A65 ToF output in Royale 4 vs. 5
When upgrading the voxl-camera-server from Royale 4 vs. 5 I noticed a big change in the output of the A65 ToF. Cleo had tuned our mapping software to the Royale 4 output and performs worse on the Royale 5 output. @James-Strawson looks like you were doing some tuning after changing to the Royale 5 processing pipeline, do you any tips for how to recover the Royale 4 performance on the new voxl-camera-server?
-
RE: Refactor voxl-camera-server into multiple processes (per cam)
I tried just hard coding the num cams and moving on to the setting of callbacks:
M_DEBUG("SUCCESS: Camera module opened on attempt %d\n", i);//This check should never fail but we should still make it if (cameraModule->init != NULL) { M_DEBUG("Calling init\n"); cameraModule->init(); M_DEBUG("Calling init done\n"); if (cameraModule->init == NULL) { M_ERROR("Camera module failed to init\n"); return NULL; } } M_DEBUG("Getting num cams\n"); int numCameras = 1; M_DEBUG("DONE Getting num cams\n"); M_DEBUG("----------- Number of cameras: %d\n\n", numCameras); M_DEBUG("Setting callbacks\n"); cameraModule->set_callbacks(&moduleCallbacks); M_DEBUG("DONE Setting callbacks\n");
So now it gets to "setting callbacks" but ABORTS before it gets to "DONE setting callbacks". Here are the logs: dmesg.txt logcat.txt
-
RE: Refactor voxl-camera-server into multiple processes (per cam)
To be clear, it printed getting num cams, did not print DONE getting num cams
-
RE: Refactor voxl-camera-server into multiple processes (per cam)
Tracked it down to when
get_number_of_cameras
is called after thecameraModule
isinit()
.M_DEBUG("SUCCESS: Camera module opened on attempt %d\n", i); //This check should never fail but we should still make it if (cameraModule->init != NULL) { M_DEBUG("Calling init\n"); cameraModule->init(); M_DEBUG("Calling init done\n"); if (cameraModule->init == NULL) { M_ERROR("Camera module failed to init\n"); return NULL; } } M_DEBUG("Getting num cams\n"); int numCameras = cameraModule->get_number_of_cameras(); M_DEBUG("DONE Getting num cams\n");
Everything work until the last line which aborts.
-
RE: Refactor voxl-camera-server into multiple processes (per cam)
As a first shot at this I just removed the
kill_existing_process
andmake_pid_file
code to allow for multiple camera server processes to be running at once and launched two instances with different tracking camera config files. The first one launched fine but the second one immediately aborts and doesn't really print any debug messages even with-d 0
:DEBUG: SUCCESS: Camera module opened on attempt 0 Aborted
-
Refactor voxl-camera-server into multiple processes (per cam)
Hi Modal,
Rowan from Cleo here, in order to figure out which camera is crashing and try to mitigate that Simon and I are investigating how to split the voxl-camera-server into multiple processes, where each process runs a single camera (so 4 processes, two for two tracking cameras, one for the ToF and one for the Hi-Res).
Do the Modal camera developers have any input on the best path to follow for this development? Any technical issues we might run into? And also, will this split have the desired affect of only having the CamEx abort affect one camera, or will all the separate processes still abort?
Thanks!
Rowan