PX4 with custom driver on DSP chip crashes when mode is not manual
-
@Moderator
Hi, I'm adding support for serial connection to vertiq modules for a project and I'm running into a difficult to debug issue with PX4. I have the driver working on normal PX4 boards, and mostly working on the VOXL2 Mini.When I am disarmed everything works as expected. If I arm in manual mode, acro mode or stabilize mode (probably others that don't require pos/vel as well) everything works normally most of the time and motors respond as expected with telemetry.
If I arm in position mode for example the DSP side of the chip seems to stop responding. The output of PX4 on the app processor becomes this:
INFO [muorb] SLPI: Armed by external command INFO [logger] Start file log (type: full) INFO [logger] [logger] /data/px4/log/2024-03-14/19_52_13.ulg WARN [mavlink] Event dropped (5, 65526) WARN [mavlink] Event dropped (5, 65526) INFO [logger] Opened full log file: /data/px4/log/2024-03-14/19_52_13.ulg WARN [mavlink] Dropped 65521 events (seq=65526) WARN [mavlink] Dropped 65521 events (seq=65526) INFO [muorb] SLPI: Advertising remote topic logger_status INFO [uORB] Advertising remote topic sensor_gps INFO [uORB] Advertising remote topic estimator_gps_status INFO [uORB] Advertising remote topic estimator_aid_src_gnss_hgt INFO [uORB] Advertising remote topic estimator_aid_src_gnss_pos INFO [uORB] Advertising remote topic estimator_aid_src_gnss_vel WARN [mavlink] Event dropped (65527, 9) WARN [mavlink] Event dropped (65527, 9) WARN [mavlink] Dropped 7 events (seq=9) WARN [mavlink] Dropped 7 events (seq=9) >>> Got an exception from send_request <<< >>> Send succeeded after retries <<< Sending topic message --- msg_id: 1033 --- topic name: offboard_control_mode Sending topic message --- msg_id: 1033 --- topic name: vehicle_visual_odometry Sending topic message --- msg_id: 1033 --- topic name: manual_control_input Sending topic message --- msg_id: 1033 --- topic name: offboard_control_mode Sending topic message --- msg_id: 1033 --- topic name: vehicle_visual_odometry Sending topic message --- msg_id: 1033 --- topic name: manual_control_input Sending topic message --- msg_id: 1033 --- topic name: offboard_control_mode Sending topic message --- msg_id: 1033 --- topic name: vehicle_visual_odometry Sending topic message --- msg_id: 1033 --- topic name: manual_control_input Sending topic message --- msg_id: 1033 --- topic name: offboard_control_mode Sending topic message --- msg_id: 1033 --- topic name: vehicle_visual_odometry Sending topic message --- msg_id: 1033 --- topic name: offboard_control_mode Sending topic message --- msg_id: 1033 --- topic name: vehicle_visual_odometry Sending topic message --- msg_id: 1033 --- topic name: manual_control_input Sending topic message --- msg_id: 1033 --- topic name: offboard_control_mode Sending topic message --- msg_id: 1033 --- topic name: vehicle_visual_odometry Sending topic message --- msg_id: 1033 --- topic name: manual_control_input Connection error: connection reset Sending topic message --- msg_id: 1033 --- topic name: offboard_control_mode Got response cb 0 Sending topic message --- msg_id: 1033 --- topic name: vehicle_visual_odometry Got response cb 0 Got flight controller event Received standard error event SNS_STD_ERROR_NOT_SUPPORTED Sending topic message --- msg_id: 1033 --- topic name: cpuload Got flight controller event Received standard error event SNS_STD_ERROR_NOT_SUPPORTED Got response cb 0 Sending topic message --- msg_id: 1033 --- topic name: manual_control_input Got response cb 0 Got flight controller event Received standard error event SNS_STD_ERROR_NOT_SUPPORTED Sending topic message --- msg_id: 1033 --- topic name: offboard_control_mode Got response cb 0 Got flight controller event Received standard error event SNS_STD_ERROR_NOT_SUPPORTED Sending topic message --- msg_id: 1033 --- topic name: vehicle_visual_odometry Got response cb 0 Got flight controller event Received standard error event SNS_STD_ERROR_NOT_SUPPORTED Sending topic message --- msg_id: 1033 --- topic name: manual_control_input Got flight controller event Received standard error event SNS_STD_ERROR_NOT_SUPPORTED Got response cb 0 Sending topic message --- msg_id: 1033 --- topic name: offboard_control_mode Got response cb 0 Got flight controller event Received standard error event SNS_STD_ERROR_NOT_SUPPORTED Sending topic message --- msg_id: 1033 --- topic name: vehicle_visual_odometry Got response cb 0 Got flight controller event Received standard error event SNS_STD_ERROR_NOT_SUPPORTED Sending topic message --- msg_id: 1033 --- topic name: manual_control_input Got response cb 0 Got flight controller event Received standard error event SNS_STD_ERROR_NOT_SUPPORTED Sending topic message --- msg_id: 1033 --- topic name: offboard_control_mode Got response cb 0 Got flight controller event Received standard error event SNS_STD_ERROR_NOT_SUPPORTED Sending topic message --- msg_id: 1033 --- topic name: vehicle_visual_odometry Got response cb 0 Got flight controller event Received standard error event SNS_STD_ERROR_NOT_SUPPORTED Sending topic message --- msg_id: 1033
It seems to be some sort of issue with muorb? Maybe the overall volume of messages is too much? I'm not sure, but would love some help as it seems like it's at the intersection of the two chips possibly.
Thanks,
Luca
-
@LucaVertiq , does the issue persist if you disable your custom module for vertiq ESCs?
-
@LucaVertiq It's probably just an indication that the DSP crashed. Unfortunately the debugging support on DSP isn't great right now so you'll have to just start ruling things out. As Alex suggested verify that everything runs fine without your new driver. If so then you can start your driver with most functionality commented out and then slowly start adding it back in to see what is causing the crash.
-
@Eric-Katzfey @Alex-Kushleyev We determined it was only when we had the telemetry section of our driver working. For now I've disabled it, but it seems like either we're bogging it down, or that the communication through lib sensors is getting bogged down. When I restart the voxl-px4 on the apps side it works again so it seems like the dsp side hasn't actually crashed. I've started looking at work_queue and it seems that we slow down the rate a bit, so we might need to optimize. I may check out how you guys are using reporting esc telemetry. Maybe we should only report it once all escs have been reported on.
-
@LucaVertiq you can check if DSP crashed during execution:
dmesg -w
and look for something like this :
[21911.381543] Fatal error on slpi! [21911.381625] slpi subsystem failure reason: err_qdi.c:1063:PC=b21d1a28,SP=e62c1fa8,FP=e62c1ff0,LR=b21d1a28,BADVA=89,CAUSE=2001,TASK=PX4_muorb_agg. [21911.381652] subsys-restart: subsystem_restart_dev(): Restart sequence requested for slpi, restart_level = RELATED. [21911.382686] adsprpc: fastrpc_restart_notifier_cb: slpi subsystem is restarting [21911.382692] subsys-restart: subsystem_shutdown(): [kworker/u19:2:17850]: Shutting down slpi
Note that currently, the DSP (SLPI) will crash when you stop PX4 (it is "normal" behavior right now). So if you are debugging your issue, make sure to pay attention when the crash message appears (during execution of voxl-px4 or after you stopped it)
-
@Alex-Kushleyev When you say 'the crash message' do you mean the ">>> Got an exception from send_request <<<"?
That occurs while voxl-px4 is running. It won't crash and will continue to output the semi repeating messages seen after that until I kill voxl-px4. I'll test the dmesg -w on Monday when I get back into work.
-
@LucaVertiq , no i mean the crash message from
dmesg
telling you that SLPI has crashed. See if that message happens at the same time as yourGot an exception from send_request
in thevoxl-px4
output.Please note that the SLPI can crash and you could still have
voxl-px4
process running on the CPU. -
@Alex-Kushleyev So it looks like this message just repeats in dmesg -w
[ 85.669815] Fatal error on slpi! [ 85.669887] slpi subsystem failure reason: err_qdi.c:1063:PC=e616ed3c,SP=31782ca0,FP=31782cc8,LR=e616ed40,BADVA=b220fb58,CAUSE=1e01,TASK=Anonymous. [ 85.669921] subsys-restart: subsystem_restart_dev(): Restart sequence requested for slpi, restart_level = RELATED. [ 85.671120] adsprpc: fastrpc_restart_notifier_cb: slpi subsystem is restarting [ 85.671127] subsys-restart: subsystem_shutdown(): [kworker/u19:0:1857]: Shutting down slpi [ 85.689766] adsprpc: fastrpc_rpmsg_remove: closed rpmsg channel of slpi [ 85.690236] coresight-remote-etm soc:ssc_etm0: Connection disconnected between QMI handle and 8 service [ 85.690290] sysmon-qmi: ssctl_del_server: Connection lost between QMI handle and slpi's SSCTL service [ 85.690690] adsprpc: fastrpc_restart_notifier_cb: received RAMDUMP notification for slpi [ 85.692605] subsys-restart: subsystem_powerup(): [kworker/u19:0:1857]: Powering up slpi [ 85.693232] subsys-pil-tz 5c00000.qcom,ssc: slpi: loading from 0x0000000088c00000 to 0x000000008a600000 [ 85.768512] subsys-pil-tz 5c00000.qcom,ssc: slpi: Brought out of reset [ 85.824221] subsys-pil-tz 5c00000.qcom,ssc: Subsystem error monitoring/handling services are up [ 85.824375] subsys-pil-tz 5c00000.qcom,ssc: slpi: Power/Clock ready interrupt received [ 85.824481] qcom_rpmh DRV:apps_rsc TCS Busy, retrying RPMH message send: addr=0x30040 [ 85.825742] adsprpc: fastrpc_restart_notifier_cb: slpi subsystem is up [ 85.825747] subsys-restart: subsystem_restart_wq_func(): [kworker/u19:0:1857]: Restart sequence for slpi completed. [ 85.827958] -1620030744:Entered [ 85.828116] -1620030744:SMD QRTR driver probed [ 85.831088] sysmon-qmi: ssctl_new_server: Connection established between QMI handle and slpi's SSCTL service [ 85.831100] coresight-remote-etm soc:ssc_etm0: Connection established between QMI handle and 8 service [ 85.834589] adsprpc: fastrpc_rpmsg_probe: opened rpmsg channel for slpi [ 85.882137] diag: In diag_send_peripheral_buffering_mode, buffering flag not set for 3 [ 87.173372] Fatal error on slpi! [ 87.173448] slpi subsystem failure reason: err_qdi.c:1063:EF:sensor_process:0x1:SNS_SEE_I_1:0x67:sns_stream_service.c:436:req_found. [ 87.173497] subsys-restart: subsystem_restart_dev(): Restart sequence requested for slpi, restart_level = RELATED. [ 87.175158] adsprpc: fastrpc_restart_notifier_cb: slpi subsystem is restarting [ 87.175170] subsys-restart: subsystem_shutdown(): [kworker/u19:0:1857]: Shutting down slpi [ 87.192557] adsprpc: fastrpc_rpmsg_remove: closed rpmsg channel of slpi [ 87.192709] coresight-remote-etm soc:ssc_etm0: Connection disconnected between QMI handle and 8 service [ 87.192730] sysmon-qmi: ssctl_del_server: Connection lost between QMI handle and slpi's SSCTL service [ 87.194300] adsprpc: fastrpc_restart_notifier_cb: received RAMDUMP notification for slpi [ 87.197243] subsys-restart: subsystem_powerup(): [kworker/u19:0:1857]: Powering up slpi [ 87.197711] subsys-pil-tz 5c00000.qcom,ssc: slpi: loading from 0x0000000088c00000 to 0x000000008a600000 [ 87.270116] subsys-pil-tz 5c00000.qcom,ssc: slpi: Brought out of reset [ 87.325674] subsys-pil-tz 5c00000.qcom,ssc: Subsystem error monitoring/handling services are up [ 87.325726] subsys-pil-tz 5c00000.qcom,ssc: slpi: Power/Clock ready interrupt received [ 87.325841] qcom_rpmh DRV:apps_rsc TCS Busy, retrying RPMH message send: addr=0x30040 [ 87.331325] adsprpc: fastrpc_restart_notifier_cb: slpi subsystem is up [ 87.331330] subsys-restart: subsystem_restart_wq_func(): [kworker/u19:0:1857]: Restart sequence for slpi completed. [ 87.331891] -1620030744:Entered [ 87.340471] -1620030744:SMD QRTR driver probed [ 87.343180] adsprpc: fastrpc_rpmsg_probe: opened rpmsg channel for slpi [ 87.344830] sysmon-qmi: ssctl_new_server: Connection established between QMI handle and slpi's SSCTL service [ 87.344850] coresight-remote-etm soc:ssc_etm0: Connection established between QMI handle and 8 service [ 87.350149] diag: In diag_send_peripheral_buffering_mode, buffering flag not set for 3 [ 88.637779] Fatal error on slpi!
Then everything from Fatal error repeats.
-
@LucaVertiq , ok so basically it looks like the SLPI does crash while the CPU-side voxl-px4 executable is still running, so the slpi is trying to re-start px4 and crashes again.
But the main point is that SLPI does crash, so there is a potential low-level issue like accessing wrong memory address or something like that.
-
@Alex-Kushleyev Thank you! I'll look through the code a bit harder to make sure I'm not doing anything silly. I'll let you know if I have any more questions.