ModalAI Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    PX4 qmi_error abort

    VOXL SDK
    2
    23
    341
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Rowan DempsterR
      Rowan Dempster @Eric Katzfey
      last edited by

      @Eric-Katzfey

      The SLPI image used to be part of the main system image. It was then separated out into it's own package for easier maintenance.

      Gotcha makes sense!

      So you have a very old version missing many important bug fixes. We've never tried installing the latest modalai-slpi package on an old SDK but I think it will work. Give it a try and see what happens.

      Will do, just wanted to confirm that it "might work" so not totally wasting my time exploring this avenue haha.

      But, obviously, it's really hard for us to support you when you use such old software with custom modifications on top of it. You should really try to make any customization such that they can easily be used with newer versions of VOXL SDK as they come out.

      Yes this is something that we constantly run into at Cleo as a small company trying release a stable product but also keep up with the latest and greatest from modal and other open source vendors. As a dev at Cleo it feels like two things are pulling in opposite directions:

      1. Having your base platform constantly updating, which then requires patches for API changes and sometimes more low level incompatibilities that come along with those updates.
      2. Getting the stability and functional improvements that come along with those base platform updates.

      I'm sure these competing forces are felt by others as well, not just at Cleo. It's a conversation that is perhaps worthy of a call between Cleo and Modal devs to get aligned on the best way to get the stability and functional improvements from each vendor (modalai) release while at the same time minimizing the Cleo dev time needed to do those API patches and minimize incompatibilities or at least forecast them before spending the time trying to do an upgrade and then finding a incompatibility.

      Rowan DempsterR 1 Reply Last reply Reply Quote 0
      • Rowan DempsterR
        Rowan Dempster @Rowan Dempster
        last edited by

        Just following up on the testing that Cleo did with @Eric-Katzfey 's suggestion of installing http://voxl-packages.modalai.com/dists/qrb5165/sdk-1.4/binary-arm64/modalai-slpi_1.1.19_arm64.deb :

        Before installing the new package

        • qmi_client_send_msg_sync at PX4 startup during boot number 48
        • qmi_client_send_msg_sync at PX4 startup during boot number 62
        • qmi_client_send_msg_sync at PX4 startup during boot number 95
        • qmi_client_send_msg_sync at PX4 startup during boot number 131
        • qmi_client_send_msg_sync at PX4 startup during boot number 22

        After installing the new package:

        • 550 boots in a row without any failures during PX4 startup

        Going forward Cleo will be installing http://voxl-packages.modalai.com/dists/qrb5165/sdk-1.4/binary-arm64/modalai-slpi_1.1.19_arm64.deb on all dronuts we build.

        Thank you for your help @Eric-Katzfey !

        Rowan DempsterR 1 Reply Last reply Reply Quote 0
        • Rowan DempsterR
          Rowan Dempster @Rowan Dempster
          last edited by

          @Eric-Katzfey Unfortunately it seems like the specific crash that was happening at 12 seconds after power in boot up was only one of the issues. After boot up is completed we are still seeing PX4 crashes with the same error message at about 77 seconds after power up:

          Mar 02 12:59:17 m0054 voxl-px4[1832]: terminate called after throwing an instance of 'qmi_error'
          Mar 02 12:59:17 m0054 voxl-px4[1832]:   what():  qmi_client_send_msg_sync() failed, (client_id=)0, result=0: qmi service error (-2)
          Mar 02 12:59:17 m0054 voxl-px4[1832]: /usr/bin/voxl-px4: line 140:  1838 Aborted                 GPS=$GPS RC=$RC OSD=$OSD EXTRA_STEPS=$EXTRA_STEPS px4 $DAEMON -s /usr/bin/voxl-px4-start
          Mar 02 12:59:17 m0054 systemd[1]: voxl-px4.service: Main process exited, code=exited, status=134/n/a
          Mar 02 12:59:17 m0054 systemd[1]: voxl-px4.service: Failed with result 'exit-code'.
          

          We were also able to get the dmesg from this system boot and these messages line up with when PX4 crashed and output the qmi_error:

          [   77.107460] Fatal error on slpi!
          [   77.107529] slpi subsystem failure reason: err_qdi.c:1079:PC=e61fc160,SP=317931e8,FP=31793268,LR=e621d784,BADVA=0,CAUSE=7003,TASK=Anonymous.
          [   77.107564] subsys-restart: subsystem_restart_dev(): Restart sequence requested for slpi, restart_level = RELATED.
          [   77.108605] adsprpc: fastrpc_restart_notifier_cb: slpi subsystem is restarting
          [   77.108612] subsys-restart: subsystem_shutdown(): [kworker/u19:0:2966]: Shutting down slpi
          [   77.120971] qcom_rpmh DRV:apps_rsc TCS Busy, retrying RPMH message send: addr=0x30030
          [   77.123099] adsprpc: fastrpc_rpmsg_remove: closed rpmsg channel of slpi
          [   77.123533] adsprpc: fastrpc_restart_notifier_cb: received RAMDUMP notification for slpi
          [   77.123932] coresight-remote-etm soc:ssc_etm0: Connection disconnected between QMI handle and 8 service
          [   77.123941] sysmon-qmi: ssctl_del_server: Connection lost between QMI handle and slpi's SSCTL service
          [   77.124485] subsys-restart: subsystem_powerup(): [kworker/u19:0:2966]: Powering up slpi
          [   77.124863] subsys-pil-tz 5c00000.qcom,ssc: slpi: loading from 0x0000000088c00000 to 0x000000008a600000
          [   77.198746] subsys-pil-tz 5c00000.qcom,ssc: slpi: Brought out of reset
          [   77.254413] subsys-pil-tz 5c00000.qcom,ssc: Subsystem error monitoring/handling services are up
          [   77.254573] subsys-pil-tz 5c00000.qcom,ssc: slpi: Power/Clock ready interrupt received
          [   77.259994] adsprpc: fastrpc_restart_notifier_cb: slpi subsystem is up
          [   77.259999] subsys-restart: subsystem_restart_wq_func(): [kworker/u19:0:2966]: Restart sequence for slpi completed.
          [   77.261053] -1836034584:Entered
          [   77.264781] -1836034584:SMD QRTR driver probed
          [   77.267518] sysmon-qmi: ssctl_new_server: Connection established between QMI handle and slpi's SSCTL service
          [   77.267568] coresight-remote-etm soc:ssc_etm0: Connection established between QMI handle and 8 service
          [   77.268271] adsprpc: fastrpc_rpmsg_probe: opened rpmsg channel for slpi
          [   77.274585] diag: In diag_send_peripheral_buffering_mode, buffering flag not set for 3
          
          Eric KatzfeyE 3 Replies Last reply Reply Quote 0
          • Eric KatzfeyE
            Eric Katzfey ModalAI Team @Rowan Dempster
            last edited by

            @Rowan-Dempster That program counter (PC=0xe61fc160) indicates that the crash happened in the loaded px4 library.

            1 Reply Last reply Reply Quote 0
            • Eric KatzfeyE
              Eric Katzfey ModalAI Team @Rowan Dempster
              last edited by

              @Rowan-Dempster There is a way to figure out where in the code the crash happened. I can update the modalai-slpi package to add a debug print in mini-dm to show the address where libpx4.so was loaded into memory. I just ran that and it showed that libpx4.so was loaded at address 0xe6120000. Once you know that you can disassemble libpx4.so to get the address map (For example: /local/mnt/workspace/Qualcomm/Hexagon_SDK/4.1.0.4/tools/HEXAGON_Tools/8.4.05/Tools/bin/hexagon-llvm-objdump -d build/modalai_voxl2-slpi_default/platforms/qurt/libpx4.so > dsp-image.dis). Take the address you get for the PC in dmesg (in this case 0xe61fc160), subtract off the base address (0xe61fc160 - 0xe6120000 = 0xdc160), then look up that address in the disassembled file. That will show you where it crashed. You also get the LR in the fatal error message so that can help show where it was called from.

              1 Reply Last reply Reply Quote 0
              • Eric KatzfeyE
                Eric Katzfey ModalAI Team @Rowan Dempster
                last edited by

                @Rowan-Dempster The new package is here: http://voxl-packages.modalai.com/dists/qrb5165/dev/binary-arm64/modalai-slpi_1.1.20-202504131441_arm64.deb

                Rowan DempsterR 1 Reply Last reply Reply Quote 0
                • Rowan DempsterR
                  Rowan Dempster @Eric Katzfey
                  last edited by

                  @Eric-Katzfey Thank you for the awesome debugging tools! We are looking into narrowing down the crash using them today.

                  We also noticed that going from modalai-slpi_1.1.19_arm64.deb to modalai-slpi_1.1.20-202504131441_arm64.deb also decreased the CPU util reported by mini-dm by 10% (36% to 26%), is that expected? What changed from 1.1.19 to 1.1.20? Which should we be using in production?

                  Thank you!

                  Eric KatzfeyE 1 Reply Last reply Reply Quote 0
                  • Eric KatzfeyE
                    Eric Katzfey ModalAI Team @Rowan Dempster
                    last edited by

                    @Rowan-Dempster First of all, the CPU report for DSP is only an estimate. It's really tough to get a good CPU estimate due to the way it "sleeps" and it's parallel architecture. But I have seen that before where there is a difference of 10% between builds and I don't know what causes that yet. So I wouldn't be too worried about it. v1.20 only adds that debug print during startup so shouldn't be any extra risk above moving to v1.19.

                    Rowan DempsterR 1 Reply Last reply Reply Quote 0
                    • Rowan DempsterR
                      Rowan Dempster @Eric Katzfey
                      last edited by

                      @Eric-Katzfey The object dump debugging method was super helpful thank you! We tracked down the null dereference within 30 minutes with these new tools, something that would have taken weeks or not be solved at all without those tools.

                      Eric KatzfeyE 1 Reply Last reply Reply Quote 0
                      • Eric KatzfeyE
                        Eric Katzfey ModalAI Team @Rowan Dempster
                        last edited by

                        @Rowan-Dempster No problem! Kind of primitive but effective 🙂

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post
                        Powered by NodeBB | Contributors