March 18, 2016

Safe bash scripting: ‘set -e’ is not enough!

Filed under: Rants,Technical — Tags: , — James Bunton @ 10:53 pm

I know, I know, shell scripting is a bad way to write software. I’ve always had a bit of a soft spot for bash though. Sure there are plenty of gotchas in bash scripting, but isn’t that the case for every programming language?

I like bash scripts

Sometimes you just want to start processes, do some piping, filtering and file redirection. For simple tasks like this a language like Python just gets in the way and makes the code more verbose.

For example the following simple bash script takes stdout from a Java app and conditionally passes it as a command line parameter to a Python application. It does this in 5 lines. No other language makes this so easy! :)

#!/bin/bash
output="$(java -jar SomeApp.jar | grep somevalue)"
if [ -n "$output" ]; then
    ./do-thing.py "$output"
fi

My recent problem

#!/bin/bash

set -e

function init_database {
    curl -o data.sql.gz --fail --retry 3 http://example.com/data.sql.gz
    gunzip data.sql.gz
    psql mydb < data.sql
}

if ! init_database; then
    echo "Failed to init database!"
    exit 1
fi

echo "Do more things..."

I have a function which downloads some SQL, decompresses it and loads it into the database. If there's an error at any point I want to detect this and handle it. Specifically I expect that in the snippet above if there is an downloading the file, then init_database will stop and I will see the message "Failed to init database", followed by the script exiting. That's what the set -e option does right?

Not so! If you try these here's what actually happens:

  • curl --fail returns a non-zero exit code, perhaps because of a networking error.
  • gunzip is executed and decompresses as much as it can.
  • psql is executed and inserts whatever data is available. It returns 0 because the file exists and it connected to the DB.
  • init_database returns 0 because the last command it executed returned 0
  • The script continues and prints "Do more things..."
  • Probably it fails much later in an annoying and/or surprising way.

Why why why!?

set -e corresponds to the bash shell option errexit. The errexit feature seems to have been hacked into the language without much up-front design. I guess this is because of the 40-something year history of the Unix shell. Bash itself dates back to 1989.

Internally when you enable the errexit option bash sets a global variable exit_immediately_on_error. Now that's a revealing choice of variable name. The shell doesn't exit immediately on all errors, even with the option enabled. If it did then all sorts of things would be impossible. In the example below the expression in the if statement is sometimes going to exit with an error. This should not cause the script to exit, merely to skip over the contents of the if block.

if [ -f somefile ]; then
    gzip somfile
fi

This behaviour is described in the documentation for the set builtin.

The shell does not exit if the command that fails is part of the command list immediately following a while or until keyword, part of the test in an if statement, part of any command executed in a && or || list except the command following the final && or ||, any command in a pipeline but the last, or if the command’s return status is being inverted with !.

This works because whenever bash starts executing the conditional in an if statement it disables the errexit feature. It leaves this disabled until the conditional has been evaluated. That means the entire init_database function is executed as if we had never asked for errexit! What's more, even if you try to set -e inside your function bash has a special case to ignore this, otherwise your shell would exit if your conditional returned false. This is the reason that the error handling did not work in the the init_database earlier.

So how do I safely execute a function and check its result?

Glad you asked! Here's a simple example you can run yourself. I've tested this with Bash 4.3.42, but I believe the behaviour should be the same on older versions.

See the inline comments in the script for explanations of why each example fails or succeeds.

#!/bin/bash
set -e

function do_work {
    echo Working

    false # fail to do something

    echo "We should never get here, there should be no more work!"
}

echo
echo "Start1"
if ! do_work; then
    echo "Failed work, this is what we want."
fi
echo "End1"
# bash disables errexit in the function because it is executed as
# "part of the test in an if statement"


echo
echo "Start2"
do_work || echo "Failed work, this is what we want."
echo "End2"
# bash disables errexit in the function because it is executed as
# "part of any command executed in a && or ||"


echo
echo "Start3"
set +e
do_work
if [ "$?" -ne 0 ]; then
    echo "Failed work, this is what we want."
fi
set -e
echo "End3"
# This obviously doesn't work because I've explicitly disabled errexit.
# If I left errexit enabled then when do_work fails the shell would
# exit immediately, which means my if statement would not be executed.


echo
echo "Start4"
do_work | cat
if [ "${PIPESTATUS[0]}" -ne 0 ]; then
    echo "Failed work, this is what we want."
fi
echo "End4"
# Success! This example leaves errexit enabled for the function, however
# the exit status is still available to be checked in PIPESTATUS array.
# Note that this won't work if enable the non-default option: set -o pipefail

Other issues to be aware of with bash scripting

Just for the record here's a few other common problems I see in shell scripting:

  • Variables must be quoted or the tokeniser will split on spaces within the variable.
  • People tend to write spaghetti code instead of using functions.
  • Functions cannot have named parameters.
  • It's not practical to build new data structures, however the builtin arrays and maps work well enough.
  • Reading unset variables gives an empty string. Use set -u.
  • Almost all variables you set are global to the script. Remember to use local foo=1 in your functions!
  • Variables set inside subshells cannot be shared with the rest of the script. This includes while loops, but not for loops.
  • If a command fails by returning non-zero status execution will cheerfully continue. Use set -e.
  • Errors from commands in a pipeline are ignored, except for the command at the end. This will succeed: false | true. You can use set -o pipefail

My recommendations on safe bash scripting

I think it would be a mistake to be too afraid of using bash just because of these problems. There are no other languages which come close to the simplicity of shell scripting when it comes to executing a series of commands in order.

If you follow these rules then you'll be ok:

  • Always use set -eu at the top of your scripts.
  • Always put quotes around everything!
  • In a pipeline don't use commands which may fail, except as the last command.
  • Don't use backticks, instead use: x="$(echo foo)".
  • Use functions for any script more than a few lines long. This is better than using comments to describe the intended behaviour of a series of commands.
  • Use local variables in your functions whenever possible.
  • Don't call functions from within an if conditional.
  • Consider writing your script in another language like Python. Particularly if your script is doing lots of complex tests and conditionals, rather than simply being a series of commands executed in sequence.

8 comments

warchinal says:

Would a short fix to your problem above (i.e. wanting init_database to return a non-zero status if the curl fails) be the following?

function init_database {
    curl -o data.sql.gz --fail --retry 3 http://example.com/data.sql.gz
    curl_status="$(echo $?)"
    gunzip data.sql.gz
    psql mydb < data.sql
    [ curl_status -eq 0 ]
}

That could allow you to explicitly care about the status of a certain command(s) in your function, and have the function's return value reflect that status.

James Bunton says:

Hi Will!

Yes something like that would a simple fix to this specific problem. Actually I’d check the curl exit status and immediately return from after the curl without running gunzip or psql.

However I wanted a general fix. If gunzip or psql fails, or if one of the other functions that is called from an if conditional fails I want all of these to work. I don’t want to have to check the exit code of every command, that’s what set -e is supposed to do for me! :)

warchinal says:

Hey James :)

Thanks for the quick reply. Given that this was obviously a solution to a real world problem (in fact I think I recognize the code) I’m glad to hear you’ve made a more sustainable solution in favor of the quick fix.

reubsnoobs says:

Hmm… excuse me if my head seems to be stuck in the BSD’s; but could you kindly explain to me why the following is unsuitable?

#! /bin/sh

init_database()
{
curl –fail –retry 3 http://example.com/data.sql.gz |
gunzip >data.sql.gz &&
psql mydb< data.sql ||
echo 'Failed to init database!'
}

# Do more things…

P.s. I haven’t read your bash code, because it really is code, as opposed to a programme.

James Bunton says:

@reubsnoobs

That wouldn’t work as-is, you’d need to add return 1 at the end.

Your basic idea is sound, however I prefer to rely on set -e rather than needing to remember to separate each statement with && and terminate functions with || return 1.

reubsnooobs says:

$ echo $?
1

James,
so I read the article now, and can sympathize w.r.t. set -e; but I hope you don’t use this super-weird hack that only works on bash in anything you release to the public… bash(1) is such an ugly thing; I’d much rather use rc(1); or pdksh, with all its faults.

If by “at the end” you mean:

/^echo/c
return 1

then this only changes all non-zero exit statuses to 1; because, in sh(1), the exit status of a function is that of the last command executed.
fi

Maybe you mean curl > data.sql.gz || return, which is also useful to prematurely abort functions.

I will admit sh(1) has rather rusty pipes; but seeing as you originially were abusing the file-system (which maybe isn’t so bad when net-working); I don’t see your problem with the very elegant && and || syntax.

Similar to your (resentful) || return idea, I do have in my .profile (simplified):

alias loop=’while true; do’

that I may write:

loop
increment_commands
[ condition ] && break
commands
done

simply because I despise any “structured” programming more complicated than a loop.

Kudos to you for grabber. I was annoyed that it wasn’t very Unix-y as it came; but I couldn’t be bothered figuring out how to get stuff from the A.B.C. (in particular). Now I will no longer be pestered by someone complaining the can’t watch Gardening Australia like it is what on the Windows. It certainly isn’t my A.B.C. if it doesn’t support freedom.

Kudos also to Media Watch for supporting freedom, while I’m on the topic.

reubsnoobs says:

Ahh. I see your confusion. The following is what your usage requires.

#! /bin/sh

init_database()
{
curl –fail –retry 3 http://example.com/data.sql.gz >data.sql.gz &&
gunzip data.sql.gz &&
psql mydb <data.sql
}

init_database ||
{
printf '%s' 'Oh'
printf ' %s.\n' 'no'
}

I had just hoped to show before how the error-handling may be concisely included in the function. Obviously, deleting the || clause will allow you to handle errors in the calling script. You could use variables to do both; but it would not be so concise.

If I'm honest I just think your solution is really silly, so take that how you will.

Anyway, kids, stay in school; don't run bash; and get ksh93 from http://www.github.com/att/ast

James Bunton says:

Hi @reubsnoobs, there’s no confusion. I understand how your solution could be made to work, but I think it’s error prone to need to remember to suffix each command with &&.

The point of this blog is that once upon a time I believed that set -e was a neat solution to this problem, unhappily I discovered it has some serious issues.

I think alternate shells like ksh, zsh, etc are not really a good solution either. I would sooner write the script in something like Python which is likely to be already installed on the target :)

Comments are closed.