Possible bug in libmodal_pipe server.c
-
I was looking through the libmodal_pipe code because we have some issues with an error
ERROR in pipe_server_write_to_client, client_id should be between 0 & 15
and I'm not sure what is causing this. We are restarting voxl programs often and I think something goes wrong with connecting/disconnecting pipes but I haven't figured it out yet.But when I looked through the code I saw that here all file descriptors are closed: https://gitlab.com/voxl-public/voxl-sdk/core-libs/libmodal-pipe/-/blob/master/library/src/server.c?ref_type=heads#L1611. But a few lines above that (https://gitlab.com/voxl-public/voxl-sdk/core-libs/libmodal-pipe/-/blob/master/library/src/server.c?ref_type=heads#L1568),
c[ch].n_clients
is set to0
so I think the code at line 1611 isn't doing anything. I don't know if it gives problems and it is probably not related to my problem but I wanted to point it out to you. -
Good find. Likely the file descriptors were being closed in the following step when they are removed from the file system which is why this hasn't presented itself as a problem. However you are right that the loop at line 1611 was not doing anything. I made the following change to address this:
Let me know what you find with regards to the client_id error and if there are any other issues in libmodal_pipe you come across.
Best,
James -
@James-Strawson I still haven't found the root cause but I wanted to share some of my findings. Maybe you can help my understanding. A little bit of background: we don't want to power off the drone but when we are not flying we go into a sleep state where we shut down some of the voxl programs and our own programs. Then when we wake up we start up the programs and everything should work again. It is after about 6 of these cycles where we have this issue (so it also takes a while to reproduce).
Here is some logging of where the problem just kicked in:
Aug 19 20:15:14 QUAD voxl-vision-hub[1524]: write to ch: 0 id: 14 result: -1 errno: 32 Aug 19 20:15:14 QUAD voxl-vision-hub[1524]: write error: Broken pipe Aug 19 20:15:14 QUAD voxl-vision-hub[1524]: previous client state was 2 Aug 19 20:15:14 QUAD voxl-vision-hub[1524]: Client voxl_pipe_handler-232508 (id 14) disconnected from channel 0 Aug 19 20:15:14 QUAD voxl-vision-hub[1524]: write to ch: 0 id: 15 result: -1 errno: 32 Aug 19 20:15:14 QUAD voxl-vision-hub[1524]: write error: Broken pipe Aug 19 20:15:14 QUAD voxl-vision-hub[1524]: previous client state was 1 Aug 19 20:15:14 QUAD voxl-vision-hub[1524]: Client voxl_pipe_handler695511 (id 15) disconnected from channel 0 Aug 19 20:15:14 QUAD voxl-vision-hub[1524]: ERROR in pipe_server_write_to_client, client_id should be between 0 & 15 Aug 19 20:15:14 QUAD voxl-vision-hub[1524]: write to ch: 1 id: 14 result: -1 errno: 32 Aug 19 20:15:14 QUAD voxl-vision-hub[1524]: write error: Broken pipe Aug 19 20:15:14 QUAD voxl-vision-hub[1524]: previous client state was 2 Aug 19 20:15:14 QUAD voxl-vision-hub[1524]: Client voxl_pipe_handler672850 (id 14) disconnected from channel 1 Aug 19 20:15:14 QUAD voxl-vision-hub[1524]: write to ch: 1 id: 15 result: -1 errno: 32 Aug 19 20:15:14 QUAD voxl-vision-hub[1524]: write error: Broken pipe Aug 19 20:15:14 QUAD voxl-vision-hub[1524]: previous client state was 1 Aug 19 20:15:14 QUAD voxl-vision-hub[1524]: Client voxl_pipe_handler507424 (id 15) disconnected from channel 1 Aug 19 20:15:14 QUAD voxl-vision-hub[1524]: ERROR in pipe_server_write_to_client, client_id should be between 0 & 15 Aug 19 20:15:14 QUAD voxl-vision-hub[1524]: ERROR in pipe_server_write_to_client, client_id should be between 0 & 15 Aug 19 20:15:14 QUAD voxl-vision-hub[1524]: ERROR in pipe_server_write_to_client, client_id should be between 0 & 15 Aug 19 20:15:14 QUAD voxl-vision-hub[1524]: ERROR in pipe_server_write_to_client, client_id should be between 0 & 15
voxl_pipe_handler
is our own program. When this was logged we weren't properly closing the pipe when our program was terminated so that could explain the write errors. But after handling that correctly, we still have the issue ofclient_id should be between 0 & 15
.A few questions. The id behind the client name seems to be a random number although the commit which adds the random number is after SDK-1.0.0. (https://gitlab.com/voxl-public/voxl-sdk/core-libs/libmodal-pipe/-/commit/b5fc28c9fc41184e0bbeee09c7f1867f9dbc1121). How is that possible? Did you deploy it well? This is the version we use:
QUAD:~$ apt show voxl-vision-hub Package: voxl-vision-hub Version: 1.6.6 Priority: optional Section: base Maintainer: James Strawson <james@modalai.com> Installed-Size: unknown Provides: voxl-vision-px4 Depends: librc-math,libmodal-pipe(>=2.4.0),libmodal-json,voxl-mpa-tools(>=0.2.5),voxl-mavlink-server,libmodal-cv(>=0.3.1) Conflicts: voxl-vision-px4,voxl-mavlink-server(<<1.0.0) Replaces: voxl-vision-px4 Download-Size: 88.4 kB APT-Manual-Installed: yes APT-Sources: file:/data/voxl-suite-offline-packages ./ Packages Description: main hub managing communication between VOXL MPA services and autopilots
In https://gitlab.com/voxl-public/voxl-sdk/core-libs/libmodal-pipe/-/blob/master/library/src/misc.c?ref_type=heads#L168, I think you're taking the modulo of a possible negative number. The modulo can then also be a negative number (https://stackoverflow.com/questions/7594508/modulo-operator-with-negative-values) so your result is not within the specified range. See also the logging above containing a negative number.
I think that reconnecting a pipe with the same name makes that a new client id is used, possibly due to the random number in the name and then no two client names will be the same. I also saw that you changed something in that part of the code in this commit: https://gitlab.com/voxl-public/voxl-sdk/core-libs/libmodal-pipe/-/commit/773de63cb73000c08df1b04bc4e2f1e78e700816 Do you think that that could be the case? I will try to test that.
-
@James-Strawson Any comment on this?
-
Good find on the negative number. I just pushed a fix for that to the dev branch.
The random number feature was added after the last stable SDK release (1.0.0) and is therefore not considered finished or stable until it's included with the SDK 1.0.1 release. So far we have been using it internally in our nightly builds and also the one customer who requested it had included it in their nightly tests without issues since it was tagged v2.9.1
I've never seen pipe_server_write_to_client throw that error and I don't know how to reproduce it. Are you calling that function manually or are you using the higher level pipe_server_write() family of functions? I also just pushed to the dev branch a commit with a couple more helpful warning prints for when it does occur.
reconnecting a pipe with a new random number does not increase the client id unless the old client remains in a connected state. Since you see the message "Client voxl_pipe_handler507424 (id 15) disconnected from channel 1" this means the client ID is being flagged as disconnected and that particular client id is now available for reuse.
Here are a couple comments in the client connection logic to help explain:
https://gitlab.com/voxl-public/voxl-sdk/core-libs/libmodal-pipe/-/blob/dev/library/src/server.c?ref_type=heads#L660
https://gitlab.com/voxl-public/voxl-sdk/core-libs/libmodal-pipe/-/blob/dev/library/src/server.c?ref_type=heads#L788Hopefully this helps you track down the issue. Let me know what you find.
Best,
James -
@James-Strawson Thanks for your reply! It made me realise what the error must be. We use a locally cloned repository for libmodalpipe because we want to build our software directly on the drone. This is then used in the build and link process of our software. This happened to be on the 'master' branch and not the 'SDK-1.0.0' branch. So the pipe clients use a different version than the pipe servers which is likely the cause of our errors.