docker fills up /data, prevents code from running

jaredjohansen · 25 Oct 2021, 22:58

I have been doing development with docker containers for some time on a particular voxl. On occasion, I will save the docker image. I believe this has, slowly, come to fill up the entire disk space on /data.

If I check the disk utilization on /data, it shows that there is plenty of disk space:

yocto:/data$ df -h /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda9        15G  8.1G  6.5G  56% /data

However, if I try to create a single file in /data, the kernel tells me there is no disk space left:

yocto:/data$ touch test.txt
touch: cannot touch 'test.txt': No space left on device

These two commands present contradictory information about the state of the disk space remaining. Since I cannot create a single file, I believe the latter command is the one that is reflective of reality.

Before this occurred, I realized that I was running out of disk space and attempted to clean up old docker images. I tried commands like docker system prune or docker prune images but the version of docker installed on the voxl does not support those commands.

Instead, I used docker image rm <name> to remove everything but two images. While that cleared up what was displayed by typing docker images, it did not appear to affect the disk utilization.

It appears that docker preserves images and containers in /data/overlay (the directory names match the image_id's and container_id's in docker). After running the above commands, the corresponding directories were not removed out of the /data/overlay directory. I am afraid to remove them manually because of my (very limited) understanding of how docker works - that images are built on top of other images. I worry that if I delete a particular directory, it could mess up the entire process that is used to build/start the latest docker image I am using.

I currently believe that docker is somehow masking/hiding how much disk space is actually being used (or the kernel is unaware of disk space that docker is no longer using). Either way, I don't know how to fix it.

At this point, when I turn on the voxl, the docker-daemon will not stay running. If I try to restart it with systemctl restart docker-daemon, it will die after about 15 seconds. I am hoping to get it to run to I can save off my latest code. After that, I'd be fine nuking /data and starting over. (If there is a better way to go about this, I'm all ears!)

For completeness, I can use journalctl -u docker-daemon to see the error messages:

Oct 25 22:45:49 apq8096 systemd[1]: Started docker service for VOXL.
Oct 25 22:45:49 apq8096 docker-prepare.sh[4628]: preparing docker with docker-prepare.sh
Oct 25 22:45:49 apq8096 docker-prepare.sh[4628]: this may take a few seconds
Oct 25 22:45:50 apq8096 docker[4627]: time="2021-10-25T22:45:50.015279000Z" level=info msg="API listen on /var/run/docker.
sock"
Oct 25 22:45:50 apq8096 docker[4627]: time="2021-10-25T22:45:50.025349000Z" level=info msg="[graphdriver] using prior stor
age driver \"overlay\""
Oct 25 22:45:50 apq8096 docker[4627]: time="2021-10-25T22:45:50.143791000Z" level=info msg="Firewalld running: false"
Oct 25 22:45:50 apq8096 docker[4627]: time="2021-10-25T22:45:50.282820000Z" level=info msg="Default bridge (docker0) is as
signed with an IP address 172.17.0.1/16. Daemon option --bip can be used to set a preferred IP address"
Oct 25 22:45:50 apq8096 docker[4627]: time="2021-10-25T22:45:50.501537000Z" level=fatal msg="Error starting daemon: unable
 to open database file"
Oct 25 22:45:50 apq8096 systemd[1]: [[1;39mdocker-daemon.service: Main process exited, code=exited, status=1/FAILURE[[0m
Oct 25 22:46:06 apq8096 docker-prepare.sh[4628]: docker-prepare: failed to see cpuset appear after 15 seconds
Oct 25 22:46:06 apq8096 systemd[1]: [[1;39mdocker-daemon.service: Control process exited, code=exited status=1[[0m
Oct 25 22:46:06 apq8096 systemd[1]: [[1;39mdocker-daemon.service: Unit entered failed state.[[0m
Oct 25 22:46:06 apq8096 systemd[1]: [[1;39mdocker-daemon.service: Failed with result 'exit-code'.[[0m

I looked in /etc/systemd/system/docker-daemon.service to learn that this is the command issued at startup: /usr/bin/docker daemon -g /data. When I run that command, I get this error:

yocto:/data$ /usr/bin/docker daemon -g /data
INFO[0000] API listen on /var/run/docker.sock           
INFO[0000] [graphdriver] using prior storage driver "overlay" 
INFO[0000] Firewalld running: false                     
INFO[0000] Default bridge (docker0) is assigned with an IP address 172.17.0.1/16. Daemon option --bip can be used to set a preferred IP address 
FATA[0000] Error starting daemon: unable to open database file

At some point earlier in my investigation (before I found the disk was full), I found someone in another modalai forum post with the same error message. They used this one-liner to fix things:

rm /data/network/files/local-kv.db

I don't know if this is somehow related to the issue at hand.

Any guidance on how to recover from this situation would be appreciated.
Any guidance on the best way to manage /data and keep track of its real disk utilization would be helpful too!

jaredjohansen · 25 Oct 2021, 23:00

There is one other thing I did that is noteworthy. I reflashed the base image with 3.3 and installed the voxl-suite. In this process, I selected the option that left /data intact.

Eric Katzfey · 25 Oct 2021, 23:13

Can you show the output of # df -i /data?

jaredjohansen · 26 Oct 2021, 14:35

Here you are:

yocto:/$ df -i /data
Filesystem     Inodes  IUsed IFree IUse% Mounted on
/dev/sda9      977280 977280     0  100% /data

jaredjohansen · 26 Oct 2021, 14:56

Here is some more info. Looks like docker is using all the inodes:

yocto:/data$ sudo find . -xdev -type f | cut -d "/" -f 2 | sort | uniq -c | sort -n              
      1 adb_devid
      1 db
      1 dhcpcd-wlan0.info
      1 dnsmasq.conf
      1 dnsmasq_d.leases
      1 l2tp_cfg.xml
      1 linkgraph.db
      1 mobileap_cfg.xml
      1 modalai
      1 network
      1 repositories-overlay
      2 usb
      2 web_root
      4 persist
      7 iproute2
     20 misc
     53 containers
    189 graph
3265231 overlay

Eric Katzfey · 26 Oct 2021, 15:08

System image 3.3.0 increases the number of inodes on /data to 3M. However, that may require that /data is completely wiped when you install it.

jaredjohansen · 26 Oct 2021, 15:16

I believe I already had system image 3.3 installed on my VOXL. (The number of inodes dedicated to overlay is around the 3M mark.) Earlier, when I said I reflashed the base image with 3.3, it is was an effort to rule out the system image being corrupt.

Is there some way to clean up all the inodes being used by docker via the command line? (The docker-daemon still dies on bootup/restart, so I can't use docker commands.)

jaredjohansen · 26 Oct 2021, 15:52

I found a partial solution. I deleted some unused, empty directories in /data (e.g., audio). This allowed me to create four free inodes.

I was able to run systemctl restart docker-daemon and it worked for just a few seconds before crashing. In that time, I was able to run docker ps -a and see the list of containers. I tried to docker start <my_container> but the docker-daemon had already crashed.

I went into /data/containers and deleted the hello world container.

I re-ran systemctl restart docker-daemon and was able to run docker rm <container_name> for some of the unused containers. Doing this a few times freed up ~200 inodes. The docker-daemon was able to run without crashing.

At that point, I was able to start <my_container>, and enter it. I was able to push my code to my git repo. This is what I set out to do.

Now that I have my data saved, I could nuke the /data/overlay directory and reclaim most of my inodes. I'd prefer if there was a better way to keep the /data/overlay directory clean (perhaps as a part of regular maintenance). This would be preferable to nuking the entire direcotry from time to time as it gets filled up. If there is a good way modalAI knows how to do this, please share!

(And thanks for the pointer about the inodes. I didn't consider that that was what was happening.)

jaredjohansen · 28 Oct 2021, 14:28

@Eric-Katzfey, let me ask a follow-up question.

Currently, the docker version that is on the VOXL is v1.9. This version doesn't support commands like docker system prune or docker prune images.

Does ModalAI plan to update the docker version used on VOXL?

Is it possible to upgrade it myself? Or are there reasons why ModalAI is still using v1.9? (A google search told me the latest version of docker is 20.10.)

Eric Katzfey · 28 Oct 2021, 15:48

We use Yocto to build the system image. It is a fairly old version (Jethro) and that really limits what we can do as far as upgrades. When you try to upgrade a single component you usually have to upgrade it's dependencies, and then the dependencies of those dependencies, etc. So upgrading the Docker version is likely a very large task and not something we are planning to do any time soon.

jaredjohansen · 28 Oct 2021, 19:33

Good to know -- thanks, Eric!