PX4 with custom driver on DSP chip crashes when mode is not manual

LucaVertiq

@Moderator
Hi, I'm adding support for serial connection to vertiq modules for a project and I'm running into a difficult to debug issue with PX4. I have the driver working on normal PX4 boards, and mostly working on the VOXL2 Mini.

When I am disarmed everything works as expected. If I arm in manual mode, acro mode or stabilize mode (probably others that don't require pos/vel as well) everything works normally most of the time and motors respond as expected with telemetry.

If I arm in position mode for example the DSP side of the chip seems to stop responding. The output of PX4 on the app processor becomes this:

INFO  [muorb] SLPI: Armed by external command
INFO  [logger] Start file log (type: full)
INFO  [logger] [logger] /data/px4/log/2024-03-14/19_52_13.ulg
WARN  [mavlink] Event dropped (5, 65526)
WARN  [mavlink] Event dropped (5, 65526)
INFO  [logger] Opened full log file: /data/px4/log/2024-03-14/19_52_13.ulg
WARN  [mavlink] Dropped 65521 events (seq=65526)
WARN  [mavlink] Dropped 65521 events (seq=65526)
INFO  [muorb] SLPI: Advertising remote topic logger_status
INFO  [uORB] Advertising remote topic sensor_gps
INFO  [uORB] Advertising remote topic estimator_gps_status
INFO  [uORB] Advertising remote topic estimator_aid_src_gnss_hgt
INFO  [uORB] Advertising remote topic estimator_aid_src_gnss_pos
INFO  [uORB] Advertising remote topic estimator_aid_src_gnss_vel
WARN  [mavlink] Event dropped (65527, 9)
WARN  [mavlink] Event dropped (65527, 9)
WARN  [mavlink] Dropped 7 events (seq=9)
WARN  [mavlink] Dropped 7 events (seq=9)
>>> Got an exception from send_request <<<
>>> Send succeeded after retries <<<
Sending topic message
--- msg_id: 1033
--- topic name: offboard_control_mode
Sending topic message
--- msg_id: 1033
--- topic name: vehicle_visual_odometry
Sending topic message
--- msg_id: 1033
--- topic name: manual_control_input
Sending topic message
--- msg_id: 1033
--- topic name: offboard_control_mode
Sending topic message
--- msg_id: 1033
--- topic name: vehicle_visual_odometry
Sending topic message
--- msg_id: 1033
--- topic name: manual_control_input
Sending topic message
--- msg_id: 1033
--- topic name: offboard_control_mode
Sending topic message
--- msg_id: 1033
--- topic name: vehicle_visual_odometry
Sending topic message
--- msg_id: 1033
--- topic name: manual_control_input
Sending topic message
--- msg_id: 1033
--- topic name: offboard_control_mode
Sending topic message
--- msg_id: 1033
--- topic name: vehicle_visual_odometry
Sending topic message
--- msg_id: 1033
--- topic name: offboard_control_mode
Sending topic message
--- msg_id: 1033
--- topic name: vehicle_visual_odometry
Sending topic message
--- msg_id: 1033
--- topic name: manual_control_input
Sending topic message
--- msg_id: 1033
--- topic name: offboard_control_mode
Sending topic message
--- msg_id: 1033
--- topic name: vehicle_visual_odometry
Sending topic message
--- msg_id: 1033
--- topic name: manual_control_input
Connection error: connection reset
Sending topic message
--- msg_id: 1033
--- topic name: offboard_control_mode
Got response cb 0
Sending topic message
--- msg_id: 1033
--- topic name: vehicle_visual_odometry
Got response cb 0
Got flight controller event
Received standard error event SNS_STD_ERROR_NOT_SUPPORTED
Sending topic message
--- msg_id: 1033
--- topic name: cpuload
Got flight controller event
Received standard error event SNS_STD_ERROR_NOT_SUPPORTED
Got response cb 0
Sending topic message
--- msg_id: 1033
--- topic name: manual_control_input
Got response cb 0
Got flight controller event
Received standard error event SNS_STD_ERROR_NOT_SUPPORTED
Sending topic message
--- msg_id: 1033
--- topic name: offboard_control_mode
Got response cb 0
Got flight controller event
Received standard error event SNS_STD_ERROR_NOT_SUPPORTED
Sending topic message
--- msg_id: 1033
--- topic name: vehicle_visual_odometry
Got response cb 0
Got flight controller event
Received standard error event SNS_STD_ERROR_NOT_SUPPORTED
Sending topic message
--- msg_id: 1033
--- topic name: manual_control_input
Got flight controller event
Received standard error event SNS_STD_ERROR_NOT_SUPPORTED
Got response cb 0
Sending topic message
--- msg_id: 1033
--- topic name: offboard_control_mode
Got response cb 0
Got flight controller event
Received standard error event SNS_STD_ERROR_NOT_SUPPORTED
Sending topic message
--- msg_id: 1033
--- topic name: vehicle_visual_odometry
Got response cb 0
Got flight controller event
Received standard error event SNS_STD_ERROR_NOT_SUPPORTED
Sending topic message
--- msg_id: 1033
--- topic name: manual_control_input
Got response cb 0
Got flight controller event
Received standard error event SNS_STD_ERROR_NOT_SUPPORTED
Sending topic message
--- msg_id: 1033
--- topic name: offboard_control_mode
Got response cb 0
Got flight controller event
Received standard error event SNS_STD_ERROR_NOT_SUPPORTED
Sending topic message
--- msg_id: 1033
--- topic name: vehicle_visual_odometry
Got response cb 0
Got flight controller event
Received standard error event SNS_STD_ERROR_NOT_SUPPORTED
Sending topic message
--- msg_id: 1033

It seems to be some sort of issue with muorb? Maybe the overall volume of messages is too much? I'm not sure, but would love some help as it seems like it's at the intersection of the two chips possibly.

Thanks,

Luca

Alex Kushleyev

@LucaVertiq , does the issue persist if you disable your custom module for vertiq ESCs?

Eric Katzfey

@LucaVertiq It's probably just an indication that the DSP crashed. Unfortunately the debugging support on DSP isn't great right now so you'll have to just start ruling things out. As Alex suggested verify that everything runs fine without your new driver. If so then you can start your driver with most functionality commented out and then slowly start adding it back in to see what is causing the crash.

LucaVertiq

@Eric-Katzfey @Alex-Kushleyev We determined it was only when we had the telemetry section of our driver working. For now I've disabled it, but it seems like either we're bogging it down, or that the communication through lib sensors is getting bogged down. When I restart the voxl-px4 on the apps side it works again so it seems like the dsp side hasn't actually crashed. I've started looking at work_queue and it seems that we slow down the rate a bit, so we might need to optimize. I may check out how you guys are using reporting esc telemetry. Maybe we should only report it once all escs have been reported on.

Alex Kushleyev

@LucaVertiq you can check if DSP crashed during execution:

dmesg -w

and look for something like this :

[21911.381543] Fatal error on slpi!
[21911.381625] slpi subsystem failure reason: err_qdi.c:1063:PC=b21d1a28,SP=e62c1fa8,FP=e62c1ff0,LR=b21d1a28,BADVA=89,CAUSE=2001,TASK=PX4_muorb_agg.
[21911.381652] subsys-restart: subsystem_restart_dev(): Restart sequence requested for slpi, restart_level = RELATED.
[21911.382686] adsprpc: fastrpc_restart_notifier_cb: slpi subsystem is restarting
[21911.382692] subsys-restart: subsystem_shutdown(): [kworker/u19:2:17850]: Shutting down slpi

Note that currently, the DSP (SLPI) will crash when you stop PX4 (it is "normal" behavior right now). So if you are debugging your issue, make sure to pay attention when the crash message appears (during execution of voxl-px4 or after you stopped it)

LucaVertiq

@Alex-Kushleyev When you say 'the crash message' do you mean the ">>> Got an exception from send_request <<<"?

That occurs while voxl-px4 is running. It won't crash and will continue to output the semi repeating messages seen after that until I kill voxl-px4. I'll test the dmesg -w on Monday when I get back into work.

Alex Kushleyev

@LucaVertiq , no i mean the crash message from dmesg telling you that SLPI has crashed. See if that message happens at the same time as your Got an exception from send_request in the voxl-px4 output.

Please note that the SLPI can crash and you could still have voxl-px4 process running on the CPU.

LucaVertiq

@Alex-Kushleyev So it looks like this message just repeats in dmesg -w

[   85.669815] Fatal error on slpi!
[   85.669887] slpi subsystem failure reason: err_qdi.c:1063:PC=e616ed3c,SP=31782ca0,FP=31782cc8,LR=e616ed40,BADVA=b220fb58,CAUSE=1e01,TASK=Anonymous.
[   85.669921] subsys-restart: subsystem_restart_dev(): Restart sequence requested for slpi, restart_level = RELATED.
[   85.671120] adsprpc: fastrpc_restart_notifier_cb: slpi subsystem is restarting
[   85.671127] subsys-restart: subsystem_shutdown(): [kworker/u19:0:1857]: Shutting down slpi
[   85.689766] adsprpc: fastrpc_rpmsg_remove: closed rpmsg channel of slpi
[   85.690236] coresight-remote-etm soc:ssc_etm0: Connection disconnected between QMI handle and 8 service
[   85.690290] sysmon-qmi: ssctl_del_server: Connection lost between QMI handle and slpi's SSCTL service
[   85.690690] adsprpc: fastrpc_restart_notifier_cb: received RAMDUMP notification for slpi
[   85.692605] subsys-restart: subsystem_powerup(): [kworker/u19:0:1857]: Powering up slpi
[   85.693232] subsys-pil-tz 5c00000.qcom,ssc: slpi: loading from 0x0000000088c00000 to 0x000000008a600000
[   85.768512] subsys-pil-tz 5c00000.qcom,ssc: slpi: Brought out of reset
[   85.824221] subsys-pil-tz 5c00000.qcom,ssc: Subsystem error monitoring/handling services are up
[   85.824375] subsys-pil-tz 5c00000.qcom,ssc: slpi: Power/Clock ready interrupt received
[   85.824481] qcom_rpmh DRV:apps_rsc TCS Busy, retrying RPMH message send: addr=0x30040
[   85.825742] adsprpc: fastrpc_restart_notifier_cb: slpi subsystem is up
[   85.825747] subsys-restart: subsystem_restart_wq_func(): [kworker/u19:0:1857]: Restart sequence for slpi completed.
[   85.827958] -1620030744:Entered
[   85.828116] -1620030744:SMD QRTR driver probed
[   85.831088] sysmon-qmi: ssctl_new_server: Connection established between QMI handle and slpi's SSCTL service
[   85.831100] coresight-remote-etm soc:ssc_etm0: Connection established between QMI handle and 8 service
[   85.834589] adsprpc: fastrpc_rpmsg_probe: opened rpmsg channel for slpi
[   85.882137] diag: In diag_send_peripheral_buffering_mode, buffering flag not set for 3
[   87.173372] Fatal error on slpi!
[   87.173448] slpi subsystem failure reason: err_qdi.c:1063:EF:sensor_process:0x1:SNS_SEE_I_1:0x67:sns_stream_service.c:436:req_found.
[   87.173497] subsys-restart: subsystem_restart_dev(): Restart sequence requested for slpi, restart_level = RELATED.
[   87.175158] adsprpc: fastrpc_restart_notifier_cb: slpi subsystem is restarting
[   87.175170] subsys-restart: subsystem_shutdown(): [kworker/u19:0:1857]: Shutting down slpi
[   87.192557] adsprpc: fastrpc_rpmsg_remove: closed rpmsg channel of slpi
[   87.192709] coresight-remote-etm soc:ssc_etm0: Connection disconnected between QMI handle and 8 service
[   87.192730] sysmon-qmi: ssctl_del_server: Connection lost between QMI handle and slpi's SSCTL service
[   87.194300] adsprpc: fastrpc_restart_notifier_cb: received RAMDUMP notification for slpi
[   87.197243] subsys-restart: subsystem_powerup(): [kworker/u19:0:1857]: Powering up slpi
[   87.197711] subsys-pil-tz 5c00000.qcom,ssc: slpi: loading from 0x0000000088c00000 to 0x000000008a600000
[   87.270116] subsys-pil-tz 5c00000.qcom,ssc: slpi: Brought out of reset
[   87.325674] subsys-pil-tz 5c00000.qcom,ssc: Subsystem error monitoring/handling services are up
[   87.325726] subsys-pil-tz 5c00000.qcom,ssc: slpi: Power/Clock ready interrupt received
[   87.325841] qcom_rpmh DRV:apps_rsc TCS Busy, retrying RPMH message send: addr=0x30040
[   87.331325] adsprpc: fastrpc_restart_notifier_cb: slpi subsystem is up
[   87.331330] subsys-restart: subsystem_restart_wq_func(): [kworker/u19:0:1857]: Restart sequence for slpi completed.
[   87.331891] -1620030744:Entered
[   87.340471] -1620030744:SMD QRTR driver probed
[   87.343180] adsprpc: fastrpc_rpmsg_probe: opened rpmsg channel for slpi
[   87.344830] sysmon-qmi: ssctl_new_server: Connection established between QMI handle and slpi's SSCTL service
[   87.344850] coresight-remote-etm soc:ssc_etm0: Connection established between QMI handle and 8 service
[   87.350149] diag: In diag_send_peripheral_buffering_mode, buffering flag not set for 3
[   88.637779] Fatal error on slpi!

Then everything from Fatal error repeats.

Alex Kushleyev

@LucaVertiq , ok so basically it looks like the SLPI does crash while the CPU-side voxl-px4 executable is still running, so the slpi is trying to re-start px4 and crashes again.

But the main point is that SLPI does crash, so there is a potential low-level issue like accessing wrong memory address or something like that.

LucaVertiq

@Alex-Kushleyev Thank you! I'll look through the code a bit harder to make sure I'm not doing anything silly. I'll let you know if I have any more questions.