Reboot Plugin for Linux in Ansible 2.7
-
Rebooting Linux systems with Ansible has always been possible, but was often tricky and error-prone. In Ansible 2.7, I am happy to say that rebooting Linux hosts with Ansible is now easier and can be done with a single task using the newly minted reboot plugin.
Some History
The
win_reboot
module was written by Matt Davis and included with Ansible 2.1. Rebooting Windows hosts is a much more common occurrence than rebooting Linux hosts. Necessity is the mother of invention, so it made sense thatwin_reboot
appeared before the equivalent for Linux. And while less than elegant, it is possible to reboot Linux hosts usingshell
andwait_for
orwait_for_connection
[1].Rebooting Linux systems with Ansible never felt right to me — much too error prone and finicky. It finally bugged me enough that I refactored
win_reboot
intoreboot
so Linux hosts could join the reboot party with their Windows counterparts.Development Story
When I set out to make the
reboot
plugin[2], the goal was to create a common class thatwin_reboot
(and potentially others) could easily subclass to override specific parts of the reboot process. I was also working in reverse, deconstructingwin_reboot
into a new base class that it would then subclass.After reading through the existing code in
win_reboot
, I came up with a general outline for how to break things up:- construct the command to run on the remote host
- do the reboot
- validate that the reboot was good
I wanted the class methods to be reusable between Linux and Windows simply by redefining appropriate variables. They also needed to be modular enough so that only certain aspects of the reboot process could be overridden if needed and not the entire
run()
method.For example, I could see up front that there were more reboot edge cases to code around in Windows than Linux. But constructing the command and its arguments, as well as the basic reboot and validate mechanics were the same. Therefore, I broke up command construction, reboot, validation, and the retry or timeout logic into separate pieces.
The overall strategy of what the plugin does has not changed:
- construct a reboot command
- capture the current last boot time of the target system
- reboot
- continuously check for the connection to be reestablished, or timeout
- continuously validate the system is actually up by running a command, or timeout
But the implementation is now more modular and less procedural.
Accounting for Different Operating Systems
I wanted this plugin to work on as many Linux (and Linux-like) operating systems as possible (and keep working the same on Windows!). To do this, I had to come up with a method to identify the operating system of the target host and account for subtle differences between distributions and versions.
While there are standard command line tools for rebooting and getting system boot time, the flags and exact syntax these commands accept varies enough that I could not use the same flags for everything. Also, the
sudo
environmentPATH
in some operating systems does not contain theshutdown
command, so I had to solve for that as well.The first thing I needed to do was probe the target system to figure out what operating system was running. I did this by running
self._execute_module(name='setup')
. While this does work, it’s a bit “heavy” since it copies over and runs all of the Python code associated with gathering facts on the target system. All I really needed was the distribution name.I decided to give
uname
a try. The output ofuname -a
varies widely across operating systems, but plain olduname
gave me what I needed: it reportsLinux
,FreeBSD
,SunOS
, andDarwin
[3]. I used the_low_level_execute_command()
method to run the command on the target host and capture the output, lowercasing it to avoid any ambiguity.uname_result = self._low_level_execute_command('uname') distribution = uname_result['stdout'].strip().lower()
If the output from
uname
doesn’t match any of these, it defaults to using theLinux
values. Thewin_reboot
module subclasses theconstruct_command()
method, so it does not probe the target system.Once I had a value to use as a lookup key, I needed to determine the command and parameters needed to actually reboot the system. Of course almost all of them are different.
DEFAULT_SHUTDOWN_COMMAND = 'shutdown' SHUTDOWN_COMMANDS = { 'linux': DEFAULT_SHUTDOWN_COMMAND, 'freebsd': DEFAULT_SHUTDOWN_COMMAND, 'sunos': '/usr/sbin/shutdown', 'darwin': '/sbin/shutdown', } SHUTDOWN_COMMAND_ARGS = { 'linux': '-r {delay_min} "{message}"', 'freebsd': '-r +{delay_sec}s "{message}"', 'sunos': '-y -g {delay_sec} -r "{message}"', 'darwin': '-r +{delay_min_macos} "{message}"' }
Let’s start with the commands. Linux and FreeBSD were able to use
shutdown
without a full path, but Solaris and macOS (darwin
) do not have theshutdown
command in thePATH
in thesudo
environment. The parameters passed to the command varied even more.The values in
SHUTDOWN_COMMAND_ARGS
are strings that will be run throughstr.format()
. One nice feature ofstr.format()
that I’m taking advantage of is the fact that you can pass in arguments that you don’t use in the format string. This let me have a different time value — minutes or seconds, and 0 or 1 minutes for Linux or macOS, respectively — and still build the command arguments in a single line:delay_min = pre_reboot_delay // 60 delay_min_macos = delay_min | 1 shutdown_command_args = shutdown_command_args.format(delay_sec=pre_reboot_delay, delay_min=delay_min, delay_min_macos=delay_min_macos, message=msg) reboot_command = '%s %s' % (shutdown_command, shutdown_command_args)
I want to talk briefly about the calculations for
delay_min
anddelay_min_macos
. The value forpre_reboot_delay
is an integer in seconds that defaults to600
but can be overridden. Since this number may not be a value that divides cleanly by60
and it needs to be a valid integer when passed to theshutdown
command, I use the//
operator which performs integer division (or floor division) which truncates a floating point result to an integer[4]. This gives me a nice clean integer I can pass to theshutdown
command and it will return a0
for any value less than60
(I did some defensive programming earlier to setpre_reboot_dealy
to0
if for some reason a negative number is passed in).This worked great on everything except macOS. Passing
shutdown -r +0
to macOS terminates the connection so abruptly that Ansible fails the play. The easy thing to do would be to just default to a1
minute delay for everything. But that one minute seems like an eternity when you are watching a playbook run. Plus, one minute multiplied by thousands (maybe millions?) of Ansible users rebooting their systems starts to add up to a lot of person-years really fast. So hopefully I’m collectively saving humanity years with this optimization.In order to default to
1
or macOS, I used the bitwise Or operator|
, affectionately known as the “pipe” character, to setdelay_min_macos
to1
ifdelay_min
is0
. Since0
is “falsy” in Python, it evaluates to False, and the variable is set to the value to the right of the bitwise Or.Accounting for Windows was done by subclassing
construct_command()
andperform_reboot()
as well as defining appropriate defaults for the shutdown command flags and the command to get the last boot time[5].win_reboot
is using all the same code for capturing last boot time, validating the system came back up, and continuously checking the connection or timing out. Nice!Once I had it working and tested it on as many different operating systems as I could find virtual machines for, I started asking around for others to test.
Exponential Backoff
In the course of code review, another of my amazing teammates suggested that I use exponential backoff for polling rather than just hitting the system once a second repeatedly until it successfully rebooted. I had never heard of this before, so it was another great opportunity to learn something new.
After doing some reading, I learned that exponential backoff is a technique for gradually increasing the time between each check. I found some examples as inspiration, and one interesting thing I read was that it’s a good idea to introduce a bit of randomness in the algorithm to prevent the same code running on distributed systems potentially all hitting the same central service in lock step. I don’t believe that was entirely necessary in this scenario, but I put it in there just in case.
Armed with a general understanding of the technique and a few good examples, I experimented and tuned the algorithm to get acceptable behavior for the plugin. Here is the algorithm I came up with.
fail_count = 0 max_fail_sleep = 12 while datetime.utcnow() < max_end_time: try: action() if action_desc: display.debug('%s: %s success' % (self._task.action, action_desc)) return except Exception as e: # Use exponential backoff with a max timout, plus a little bit of randomness random_int = random.randint(0, 1000) / 1000 fail_sleep = 2 ** fail_count + random_int if fail_sleep > max_fail_sleep: fail_sleep = max_fail_sleep + random_int if action_desc: display.debug("{0}: {1} fail '{2}', retrying in {3:.4} seconds...".format(self._task.action, action_desc, to_text(e), fail_sleep)) fail_count += 1 time.sleep(fail_sleep)
I set an upper bound with
max_fail_sleep
to prevent the wait time between each test from getting huge. I didn’t want the system to come back up in the middle of a really long sleep. I arrived at twelve just by experimenting and seeing what felt right and behaved well with my test systems. The end result is one to three queries before the play continues rather than ten or more. Thanks, Sviat, for the suggestion!Friends Who Break Your Beautiful Code
I’m very fortunate to have some former coworkers who are still good friends, Ansible users, and very savvy with Linux. I asked if they would help test my plugin, helped them get setup to test, and very quickly got a report that “It looks stuck at the reboot task”.
It’s good to have friends that break your code.
After a few hours of late night troubleshooting, we determined that the output of
who -b
from the his system was epoch:1970-01-01 00:00
. That threw a wrench in my “did the system actually reboot?” logic since it was comparing that value continuously and waiting for it to change. Since that value was the same both before and after reboot, the plugin assumed the system had not yet rebooted and eventually timed out.After some research, it turns out that systems that lack a real time clock, such as the Orange Pi in my friend’s test, do not properly set the last boot time. I ended up using
uptime -s
on those particular systems to work around this.Ideally, I could set the default uptime check command to
uptime -s
, but the-s
flag touptime
is far from universally available. It is, however, on all recent versions of Armbian and Raspbian, which are the most likely systems to lack a real time clock and have incorrect output fromwho -b
.I added a check in
get_system_boot_time()
to account for this scenario, and the plugin now works quite well on several Pi flavors:if '1970-01-01 00:00' in command_result['stdout']: command_result = self._low_level_execute_command('uptime -s', sudoable=self.DEFAULT_SUDOABLE)
Examples
Here is an example of what rebooting Linux systems looked liked before the
reboot
plugin.- name: Reboot system shell: sleep 2 && shutdown -r now async: 5 poll: 0 - name: Wait for system to come back up wait_for: host: "{{ ansible_host }}" port: 22 search_regex: OpenSSH delay: 15 delegate_to: localhost
There are several variations on this, but it boils down to a
command
orshell
task followed bywait_for
orwait_for_connection
. With thereboot
plugin, this is much more straight forward:- name: Reboot system reboot:
If you want to adjust the timeout for systems that take longer to boot, or run a different command to verify the system came back up, you can do that easily with a few parameters:
- name: Reboot system reboot: reboot_timeout: 1200 test_command: mount
The
reboot
module will wait for the system to come back up, then run thetest_command
until it returns an exit code of 0 or the timeout value is reached. Since this in an action plugin, it runs on the control machine, so there is no need to worry about delegating the task to the appropriate host. There is even a failsafe in the plugin to prevent from accidentally rebooting the control node.If you want to reboot both Windows and Linux hosts with the same task, you can do this using the action keyword. Configure
group_vars
with the appropriate action plugin name, privilege escalation settings, and any additional arguments that you want to be group specific.# group_vars/windows.yml reboot_action: win_reboot reboot_action_message: Rebooting Windows with Ansible ansible_become: no
- name: Reboot Windows and Linux hosts: all become: yes
tasks: - name: Reboot action: "{{ reboot_action | default('reboot') }}" args: msg: "{{ reboot_action_message | default(omit) }}"Future Improvements
I’m very happy with how this turned out and hope it will make rebooting Linux systems with Ansible much easier than it is today. I already have some ideas for future features, such as support for pre-authenticated reboots for FileVault encrypted volumes. I would love to hear from anyone using the
reboot
plugin and welcome your feedback and pull requests.Thanks
This was a pretty tough project for me during which I learned a lot — and I absolutely did not do it alone. I relied heavily on my teammates for input and guidance. Thank you to Matt Martz (sivel), Matt Davis(nitzmahone) (the original author of
win_reboot
whose code I mostly rearranged and polished), Toshio Kuratomi(abadger), and Sviatoslav Sydorenko(webknjaz) for the detailed conversations, wonderful feedback, and answering all my dumb questions with kindness and insight.-
The
wait_for_connection
module uses the exact same validation code that was inwin_reboot
originally. ↩ -
It’s an action plugin, not a module. Action plugins are run on the controller, while modules are copied to the managed host and executed there. It wouldn’t make sense for this to be a module since that would mean rebooting the system running the module, leaving nothing behind to verify the machine came back up. ↩
-
This is just what I had available to test on. I’d love to add more operating systems if anyone has systems to test against and can send me the output of
uname
. ↩ -
In Ansible, we use
from __future__ import division
to make division consistent between Python 2 and Python 3. ↩ -
Believe it or not, Windows and Linux actually use the same name for the shutdown command:
shutdown
. It’s very original. ↩
https://www.ansible.com/blog/reboot-plugin-for-linux-in-ansible-2-7
© Lightnetics 2024