docker fills up /data, prevents code from running
-
I have been doing development with docker containers for some time on a particular voxl. On occasion, I will save the docker image. I believe this has, slowly, come to fill up the entire disk space on /data.
If I check the disk utilization on
/data
, it shows that there is plenty of disk space:yocto:/data$ df -h /data Filesystem Size Used Avail Use% Mounted on /dev/sda9 15G 8.1G 6.5G 56% /data
However, if I try to create a single file in /data, the kernel tells me there is no disk space left:
yocto:/data$ touch test.txt touch: cannot touch 'test.txt': No space left on device
These two commands present contradictory information about the state of the disk space remaining. Since I cannot create a single file, I believe the latter command is the one that is reflective of reality.
Before this occurred, I realized that I was running out of disk space and attempted to clean up old docker images. I tried commands like
docker system prune
ordocker prune images
but the version of docker installed on the voxl does not support those commands.Instead, I used
docker image rm <name>
to remove everything but two images. While that cleared up what was displayed by typingdocker images
, it did not appear to affect the disk utilization.It appears that docker preserves images and containers in
/data/overlay
(the directory names match the image_id's and container_id's in docker). After running the above commands, the corresponding directories were not removed out of the /data/overlay directory. I am afraid to remove them manually because of my (very limited) understanding of how docker works - that images are built on top of other images. I worry that if I delete a particular directory, it could mess up the entire process that is used to build/start the latest docker image I am using.I currently believe that docker is somehow masking/hiding how much disk space is actually being used (or the kernel is unaware of disk space that docker is no longer using). Either way, I don't know how to fix it.
At this point, when I turn on the voxl, the docker-daemon will not stay running. If I try to restart it with
systemctl restart docker-daemon
, it will die after about 15 seconds. I am hoping to get it to run to I can save off my latest code. After that, I'd be fine nuking/data
and starting over. (If there is a better way to go about this, I'm all ears!)For completeness, I can use
journalctl -u docker-daemon
to see the error messages:Oct 25 22:45:49 apq8096 systemd[1]: Started docker service for VOXL. Oct 25 22:45:49 apq8096 docker-prepare.sh[4628]: preparing docker with docker-prepare.sh Oct 25 22:45:49 apq8096 docker-prepare.sh[4628]: this may take a few seconds Oct 25 22:45:50 apq8096 docker[4627]: time="2021-10-25T22:45:50.015279000Z" level=info msg="API listen on /var/run/docker. sock" Oct 25 22:45:50 apq8096 docker[4627]: time="2021-10-25T22:45:50.025349000Z" level=info msg="[graphdriver] using prior stor age driver \"overlay\"" Oct 25 22:45:50 apq8096 docker[4627]: time="2021-10-25T22:45:50.143791000Z" level=info msg="Firewalld running: false" Oct 25 22:45:50 apq8096 docker[4627]: time="2021-10-25T22:45:50.282820000Z" level=info msg="Default bridge (docker0) is as signed with an IP address 172.17.0.1/16. Daemon option --bip can be used to set a preferred IP address" Oct 25 22:45:50 apq8096 docker[4627]: time="2021-10-25T22:45:50.501537000Z" level=fatal msg="Error starting daemon: unable to open database file" Oct 25 22:45:50 apq8096 systemd[1]: [[1;39mdocker-daemon.service: Main process exited, code=exited, status=1/FAILURE[[0m Oct 25 22:46:06 apq8096 docker-prepare.sh[4628]: docker-prepare: failed to see cpuset appear after 15 seconds Oct 25 22:46:06 apq8096 systemd[1]: [[1;39mdocker-daemon.service: Control process exited, code=exited status=1[[0m Oct 25 22:46:06 apq8096 systemd[1]: [[1;39mdocker-daemon.service: Unit entered failed state.[[0m Oct 25 22:46:06 apq8096 systemd[1]: [[1;39mdocker-daemon.service: Failed with result 'exit-code'.[[0m
I looked in
/etc/systemd/system/docker-daemon.service
to learn that this is the command issued at startup:/usr/bin/docker daemon -g /data
. When I run that command, I get this error:yocto:/data$ /usr/bin/docker daemon -g /data INFO[0000] API listen on /var/run/docker.sock INFO[0000] [graphdriver] using prior storage driver "overlay" INFO[0000] Firewalld running: false INFO[0000] Default bridge (docker0) is assigned with an IP address 172.17.0.1/16. Daemon option --bip can be used to set a preferred IP address FATA[0000] Error starting daemon: unable to open database file
At some point earlier in my investigation (before I found the disk was full), I found someone in another modalai forum post with the same error message. They used this one-liner to fix things:
rm /data/network/files/local-kv.db
I don't know if this is somehow related to the issue at hand.
Any guidance on how to recover from this situation would be appreciated.
Any guidance on the best way to manage/data
and keep track of its real disk utilization would be helpful too! -
There is one other thing I did that is noteworthy. I reflashed the base image with 3.3 and installed the voxl-suite. In this process, I selected the option that left /data intact.
-
Can you show the output of
# df -i /data
? -
Here you are:
yocto:/$ df -i /data Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sda9 977280 977280 0 100% /data
-
Here is some more info. Looks like docker is using all the inodes:
yocto:/data$ sudo find . -xdev -type f | cut -d "/" -f 2 | sort | uniq -c | sort -n 1 adb_devid 1 db 1 dhcpcd-wlan0.info 1 dnsmasq.conf 1 dnsmasq_d.leases 1 l2tp_cfg.xml 1 linkgraph.db 1 mobileap_cfg.xml 1 modalai 1 network 1 repositories-overlay 2 usb 2 web_root 4 persist 7 iproute2 20 misc 53 containers 189 graph 3265231 overlay
-
System image 3.3.0 increases the number of inodes on /data to 3M. However, that may require that /data is completely wiped when you install it.
-
I believe I already had system image 3.3 installed on my VOXL. (The number of inodes dedicated to
overlay
is around the 3M mark.) Earlier, when I said I reflashed the base image with 3.3, it is was an effort to rule out the system image being corrupt.Is there some way to clean up all the inodes being used by docker via the command line? (The docker-daemon still dies on bootup/restart, so I can't use docker commands.)
-
I found a partial solution. I deleted some unused, empty directories in
/data
(e.g.,audio
). This allowed me to create four free inodes.I was able to run
systemctl restart docker-daemon
and it worked for just a few seconds before crashing. In that time, I was able to rundocker ps -a
and see the list of containers. I tried todocker start <my_container>
but the docker-daemon had already crashed.I went into
/data/containers
and deleted thehello world
container.I re-ran
systemctl restart docker-daemon
and was able to rundocker rm <container_name>
for some of the unused containers. Doing this a few times freed up ~200 inodes. The docker-daemon was able to run without crashing.At that point, I was able to start
<my_container>
, and enter it. I was able to push my code to my git repo. This is what I set out to do.Now that I have my data saved, I could nuke the /
data/overlay
directory and reclaim most of my inodes. I'd prefer if there was a better way to keep the/data/overlay
directory clean (perhaps as a part of regular maintenance). This would be preferable to nuking the entire direcotry from time to time as it gets filled up. If there is a good way modalAI knows how to do this, please share!(And thanks for the pointer about the inodes. I didn't consider that that was what was happening.)
-
@Eric-Katzfey, let me ask a follow-up question.
Currently, the docker version that is on the VOXL is v1.9. This version doesn't support commands like
docker system prune
ordocker prune images
.Does ModalAI plan to update the docker version used on VOXL?
Is it possible to upgrade it myself? Or are there reasons why ModalAI is still using v1.9? (A google search told me the latest version of docker is 20.10.)
-
We use Yocto to build the system image. It is a fairly old version (Jethro) and that really limits what we can do as far as upgrades. When you try to upgrade a single component you usually have to upgrade it's dependencies, and then the dependencies of those dependencies, etc. So upgrading the Docker version is likely a very large task and not something we are planning to do any time soon.
-
Good to know -- thanks, Eric!