Reproduce DPKG Errors
I really just needed to know how to find the error
So here is the initial error that I was facing in Ansible that would only pop up every now and then:
TASK [Update all packages on the system] ***************************************
22:43:29 fatal: [localhost]: FAILED! => {
22:43:29 "changed": false,
22:43:29 "rc": 100
22:43:29 }
22:43:29
22:43:29 MSG:
22:43:29
22:43:29 '/usr/bin/apt-get upgrade --with-new-pkgs ' failed: E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
22:43:29 E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
Which is fairly innocuous on its own. There should be some type of way to check for this, rescue it, etc. But the problem that I would be running into is “how do I access this error?” This is usually solved by reproducing the issue and looking at the verbose output to check which variable needs to be registered and used. This is where the trouble started…
Unable to Fail Task Successfully
So, if we just run the offending auto update command manually, that’ll create the lock, right?
# /bin/sh /usr/lib/apt/apt.systemd.daily update
But of course, it wasn’t that easy. That took too short of a time to run, and even if I kicked off ansible manually at the exact same time, I couldn’t reproduce the error.
File Locks
So I went and tried to just acquire a file lock on that file descriptor. I tried holding it open manually, and even calling flock()
on it:
# flock /var/lib/dpkg/lock-frontend sleep 200
But apt still completed an install successfully. Umm… what?
As I see,
dpkg
locks the file/var/lib/dpkg/lock
usinglockf(3)
, which in turn usesfcntl(2)
.This means that you cannot lock the file from the shell using
flock(1)
, because this callsflock(2)
, which in many systems doesn’t interact withfnctl
locks.You can however lock the file with the
with-lock-ex
program […] which acquire[s] a compatible lock.https://unix.stackexchange.com/questions/71716/how-var-lib-dpkg-lock-works
So that’s good news. Let’s try it that way:
# apt install chiark-utils-bin
# with-lock-ex -f /var/lib/dpkg/lock-frontend /bin/sleep 200
At that point, the following ansible command fails successfully!
ansible all -i localhost, -bm apt -a "name='*' state=latest update_cache=True autoremove=True"
Now, working on fixing the issue…
To rescue or proactively check
Historically, I know that this lock disappears within a period of time. So now we have to determine how to handle this eventuality.
There are typically two ways to do this. We can check on the state of the machine directly before we run the upgrade, or we can try to run the upgrade (asking forgiveness rather than permission) and handle the error if it pops up. Each is equally as valuable, and will require touching a lot of things and going through different hoops to get this done.
While trying to run the upgrade just as-is feels better on the front end, it also suffers from having to re-code that same apt update, or write conditionals that retry only on that specific error. Luckily, it looks like with-lock-ex
has given us a way out…
# ansible all -i localhost, -ba "with-lock-ex -f /var/lib/dpkg/lock-frontend /bin/sleep 1"
localhost | FAILED | rc=255 >>
with-lock-ex sleep: could not acquire lock: Resource temporarily unavailablenon-zero return code
So when we cannot acquire the lock, with-lock-ex
will conveniently fail for us. Ansible has a fairly eloquent way to retry this until this particular command succeeds:
# cat playbooks/check_filelock.yml
- hosts: all
gather_facts: False
tasks:
- name: Keep trying to get the filelock until we can
register: filelock
shell: "with-lock-ex -f /var/lib/dpkg/lock-frontend sleep 1"
until: filelock['rc'] == 0
retries: 60
delay: 5
# ansible-playbook -i localhost, playbooks/check_filelock.yml
Using /home/andrewcz/Projects/play-compositional/ansible.cfg as config file
PLAY [all] ***********************************************************************************************************************************************************************************************************************************
TASK [Keep trying to get the filelock until we can] ******************************************************************************************************************************************************************************************
FAILED - RETRYING: Keep trying to get the filelock until we can (60 retries left).
FAILED - RETRYING: Keep trying to get the filelock until we can (59 retries left).
FAILED - RETRYING: Keep trying to get the filelock until we can (58 retries left).
FAILED - RETRYING: Keep trying to get the filelock until we can (57 retries left).
FAILED - RETRYING: Keep trying to get the filelock until we can (56 retries left).
FAILED - RETRYING: Keep trying to get the filelock until we can (55 retries left).
FAILED - RETRYING: Keep trying to get the filelock until we can (54 retries left).
changed: [localhost] => {
"attempts": 8,
"changed": true,
"cmd": "with-lock-ex -f /var/lib/dpkg/lock-frontend sleep 1",
"delta": "0:00:01.006632",
"end": "2021-03-07 05:46:09.237544",
"rc": 0,
"start": "2021-03-07 05:46:08.230912"
}
PLAY RECAP ***********************************************************************************************************************************************************************************************************************************
localhost : ok=1 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
This will obviously time out after 5 minutes if we’re unable to acquire that lock. And since ansible (to my knowledge) does not have decorators or any way to have “all apt commands run this one task first”, then we’ll just put this task before any of the apt commands in hopes that we catch this error whenever it pops up.
Luckily, this will allow me to get rid of what I had been doing, which was to simply waiting 120 seconds before that task and just hoping that did the trick. That’s what we call, “code to be ashamed of.”
Initial Config
But what if we don’t have that program installed on that server? Can we install a program with a package manager to help us check if we’re able to run the package manager successfully?
Of course not, that’s stupid.
But, in a very real sense, yes we can:
- name: Install with-lock-ex package
apt:
name: chiark-utils-bin
state: latest
update_cache: True
autoremove: True
register: restore_install_with_lock_first
# Here we're also saying to continue on if we're facing a 'could not get lock'
# error, since we can handle that later.
failed_when:
- restore_install_with_lock_first['msg'] is defined
- not restore_install_with_lock_first['msg'] | regex_search('Failed to lock apt|Could not get lock')
- name: Install with-lock-ex package
apt:
name: chiark-utils-bin
state: latest
update_cache: True
autoremove: True
register: restore_install_with_lock_second
retries: 60
delay: 5
until:
- restore_install_with_lock_second['msg'] is not defined
when:
- restore_install_with_lock_first['msg'] is defined
- restore_install_with_lock_first['msg'] | regex_search('Failed to lock apt|Could not get lock')
This allows us to use the same logic to loop over the install that we did with the initial check. I’m pretty sure one of the conditionals in the second task is superfluous, but we’ll just leave it like that for now.
P.S. This took me 4 hours to figure out after a couple of iterations (look at the git history for this file), so I’m pretty sure it’s right.