My Profile Photo

AndrewCz


Using liberty-minded opensource tools, and using them well


Reproduce DPKG Errors




Every now and then on Ubuntu systems, Ansible runs into an error where it fails to install/upgrade packages because the daily update is occurring. However, you can't reproduce it easily, for a myriad of reasons. Let's find out why...

I really just needed to know how to find the error

So here is the initial error that I was facing in Ansible that would only pop up every now and then:

TASK [Update all packages on the system] ***************************************
22:43:29        fatal: [localhost]: FAILED! => {
22:43:29            "changed": false,
22:43:29            "rc": 100
22:43:29        }
22:43:29        
22:43:29        MSG:
22:43:29        
22:43:29        '/usr/bin/apt-get upgrade --with-new-pkgs ' failed: E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
22:43:29        E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

Which is fairly innocuous on its own. There should be some type of way to check for this, rescue it, etc. But the problem that I would be running into is “how do I access this error?” This is usually solved by reproducing the issue and looking at the verbose output to check which variable needs to be registered and used. This is where the trouble started…

Unable to Fail Task Successfully

So, if we just run the offending auto update command manually, that’ll create the lock, right?

# /bin/sh /usr/lib/apt/apt.systemd.daily update

But of course, it wasn’t that easy. That took too short of a time to run, and even if I kicked off ansible manually at the exact same time, I couldn’t reproduce the error.

File Locks

So I went and tried to just acquire a file lock on that file descriptor. I tried holding it open manually, and even calling flock() on it:

# flock /var/lib/dpkg/lock-frontend sleep 200

But apt still completed an install successfully. Umm… what?

As I see, dpkg locks the file /var/lib/dpkg/lock using lockf(3), which in turn uses fcntl(2).

This means that you cannot lock the file from the shell using flock(1), because this calls flock(2), which in many systems doesn’t interact with fnctl locks.

You can however lock the file with the with-lock-ex program […] which acquire[s] a compatible lock.

https://unix.stackexchange.com/questions/71716/how-var-lib-dpkg-lock-works

So that’s good news. Let’s try it that way:

# apt install chiark-utils-bin
# with-lock-ex -f /var/lib/dpkg/lock-frontend /bin/sleep 200

At that point, the following ansible command fails successfully!

ansible all -i localhost, -bm apt -a "name='*' state=latest update_cache=True autoremove=True"

Now, working on fixing the issue…

To rescue or proactively check

Historically, I know that this lock disappears within a period of time. So now we have to determine how to handle this eventuality.

There are typically two ways to do this. We can check on the state of the machine directly before we run the upgrade, or we can try to run the upgrade (asking forgiveness rather than permission) and handle the error if it pops up. Each is equally as valuable, and will require touching a lot of things and going through different hoops to get this done.

While trying to run the upgrade just as-is feels better on the front end, it also suffers from having to re-code that same apt update, or write conditionals that retry only on that specific error. Luckily, it looks like with-lock-ex has given us a way out…

# ansible all -i localhost, -ba "with-lock-ex -f /var/lib/dpkg/lock-frontend /bin/sleep 1"
localhost | FAILED | rc=255 >>
with-lock-ex sleep: could not acquire lock: Resource temporarily unavailablenon-zero return code

So when we cannot acquire the lock, with-lock-ex will conveniently fail for us. Ansible has a fairly eloquent way to retry this until this particular command succeeds:

# cat playbooks/check_filelock.yml
- hosts: all
  gather_facts: False
  tasks:
    - name: Keep trying to get the filelock until we can
      register: filelock
      shell: "with-lock-ex -f /var/lib/dpkg/lock-frontend sleep 1"
      until: filelock['rc'] == 0
      retries: 60
      delay: 5

# ansible-playbook -i localhost, playbooks/check_filelock.yml
Using /home/andrewcz/Projects/play-compositional/ansible.cfg as config file

PLAY [all] ***********************************************************************************************************************************************************************************************************************************

TASK [Keep trying to get the filelock until we can] ******************************************************************************************************************************************************************************************
FAILED - RETRYING: Keep trying to get the filelock until we can (60 retries left).
FAILED - RETRYING: Keep trying to get the filelock until we can (59 retries left).
FAILED - RETRYING: Keep trying to get the filelock until we can (58 retries left).
FAILED - RETRYING: Keep trying to get the filelock until we can (57 retries left).
FAILED - RETRYING: Keep trying to get the filelock until we can (56 retries left).
FAILED - RETRYING: Keep trying to get the filelock until we can (55 retries left).
FAILED - RETRYING: Keep trying to get the filelock until we can (54 retries left).
changed: [localhost] => {
    "attempts": 8,
    "changed": true,
    "cmd": "with-lock-ex -f /var/lib/dpkg/lock-frontend sleep 1",
    "delta": "0:00:01.006632",
    "end": "2021-03-07 05:46:09.237544",
    "rc": 0,
    "start": "2021-03-07 05:46:08.230912"
}

PLAY RECAP ***********************************************************************************************************************************************************************************************************************************
localhost            : ok=1    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

This will obviously time out after 5 minutes if we’re unable to acquire that lock. And since ansible (to my knowledge) does not have decorators or any way to have “all apt commands run this one task first”, then we’ll just put this task before any of the apt commands in hopes that we catch this error whenever it pops up.

Luckily, this will allow me to get rid of what I had been doing, which was to simply waiting 120 seconds before that task and just hoping that did the trick. That’s what we call, “code to be ashamed of.”

Initial Config

But what if we don’t have that program installed on that server? Can we install a program with a package manager to help us check if we’re able to run the package manager successfully?

Of course not, that’s stupid.

But, in a very real sense, yes we can:

- name: Install with-lock-ex package
  apt:
    name: chiark-utils-bin
    state: latest
    update_cache: True
    autoremove: True
  register: restore_install_with_lock_first
  # Here we're also saying to continue on if we're facing a 'could not get lock'
  # error, since we can handle that later.
  failed_when:
    - restore_install_with_lock_first['msg'] is defined
    - not restore_install_with_lock_first['msg'] | regex_search('Failed to lock apt|Could not get lock')

- name: Install with-lock-ex package
  apt:
    name: chiark-utils-bin
    state: latest
    update_cache: True
    autoremove: True
  register: restore_install_with_lock_second
  retries: 60
  delay: 5
  until:
    - restore_install_with_lock_second['msg'] is not defined
  when:
    - restore_install_with_lock_first['msg'] is defined
    - restore_install_with_lock_first['msg'] | regex_search('Failed to lock apt|Could not get lock')

This allows us to use the same logic to loop over the install that we did with the initial check. I’m pretty sure one of the conditionals in the second task is superfluous, but we’ll just leave it like that for now.

P.S. This took me 4 hours to figure out after a couple of iterations (look at the git history for this file), so I’m pretty sure it’s right.


References: