Marmelab Blog

How-to dump your Docker-ized database on Amazon S3?

We sometimes use Docker for our staging environments. Putting code in staging or production also means required data backups. Here is a post describing how we regularly upload database dumps directly from a Docker container to Amazon S3 servers.

Dumping Docker-ized Database on Host

The first step is to create a dump of the database. Let's consider we have an awesomeproject_pgsql container:

#!/bin/bash

# Configuration
CONTAINER_NAME="awesomeproject_pgsql"
FILENAME="awesomeproject-`date +%Y-%m-%d-%H:%M:%S`.sql"
DUMPS_FOLDER = "/home/awesomeproject/dumps/"

# Backup from docker container
docker exec $CONTAINER_NAME sh -c "PGPASSWORD=\"\$POSTGRES_PASSWORD\" pg_dump --username=\$POSTGRES_USER \$POSTGRES_USER > /tmp/$FILENAME"
docker cp $CONTAINER_NAME:/tmp/$FILENAME $DUMPS_FOLDER
docker exec $CONTAINER_NAME sh -c "rm /tmp/awesomeproject-*.sql"

After setting a few configuration variables to ease re-use of this script, we execute via the docker exec command a dump of our PSQL database. For a MySQL one, the process would be quite the sameā€“just replace pg_dump by mysqldump. We put the dump into the container /tmp folder.

Note the sh -c command. This way, we can pass a whole command (including file redirections) as a string, without worrying about conflicts between host and container paths.

Database credentials are passed via environment variables. Note the \ inside the Docker shell command: it is important as we want these variables to be interpreted in the Docker container, not when the host interprets the command. In this example, we're using the official PostGreSQL image from Docker Hub, setting these variables at container creation.

We then copy the file from the container to the host using the docker cp command, and clean the container temporary folder.

Keeping Only Last Backups

We generally keep data backups for a week. We then have 7 days to detect some data anomalies and to restore a dump. Some developers prefers to keep another backup per month, but except in very sensitive environments, month-old data is too old to be useful. So, let's focus on the last 7 days:

# Keep only 7 most recent backups
cd $DUMPS_FOLDER && (ls -t | head -n 7 ; ls -t) | uniq -u | xargs --no-run-if-empty rm
cd $DUMPS_FOLDER && bzip2 --best $FILENAME

The first command probably looks like voodoo. Let's explain it. First, we go into the dumps folder. Note that we need to repeat it on all following lines. Indeed, commands are executed each in their own sub-process. Thus, the cd only affects the current command, on a single line.

The (ls -t | head -n 7 ; ls -t) command is a parallel command. We actually execute two different commands at the same time, to send their result on the standard output. We list (ls) our files by modification time (-t), newer first. We then keep the 7 first lines (head -n 7).

In addition, we also display all available files. So, all files we should keep are displayed twice.

Finally, we keep only the unique lines, using uniq -u, and rm each result. The --no-run-if-empty option is just to prevent an error when no file needs to be deleted.

As we pay Amazon depending of the amount of data stored on S3, we also compress at the maximum our dumps, using bzip2 --best. Even if prices are really cheap, it is always worthy to use a single command to save money, isn't it?

Uploading Database Dumps on Amazon S3

Storing dumps on the same machine than databases is not a good idea. Hard disks may crash, users may block themselves from connecting because of too restrictive iptables rules (true story), etc. So, let's use a cheap and easy-to-use solution: moving our dumps to Amazon S3.

For those unfamiliar with Amazon services, S3 should have been called "Amazon Unlimited FTP Servers", according to the AWS in Plain English post. It is used to:

Store images and other assets for websites. Keep backups and share files between services. Host static websites. Also, many of the other AWS services write and read from S3.

Of course, you can restrict access to your files, as we are going to configure it.

Creating a Bucket

First step is to create a S3 bucket. You can replace bucket by hosting for a better understanding. So, log on your Amazon Web Services (AWS) console, and select S3. Then, Create bucket give it a name and choose a location for it.

Create a bucket on Amazon S3

Keep defaut values for all others parameters. We are now going to restrict access to our bucket, by creating a specific user.

Restricting Access to the Bucket

This is the only painful step: creating a user with correct permissions to access our bucket. Indeed, our server script needs to connect to AWS. We can of course give it our own root AWS credentials. But, as it is safer to use a low-privileges account instead of root on Linux, we are going to create an awesomeproject user.

So, go on the IAM service (Identity and Access Management). On the left menu, choose Policies and create a new one from scratch. Give it a name, a description, and add the following content:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListAllMyBuckets"
            ],
            "Resource": "arn:aws:s3:::*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::awesomeproject-private"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::awesomeproject-private/*"
            ]
        }
    ]
}

This policy allows three operations, respectively:

  • List all buckets available to user (required to connect on S3),
  • List bucket content on the specified resource,
  • Write, read and delete permissions on bucket content.

Create a Amazon IAM policy

Validate the policy and create a new user (in the Users menu). Do not forget to save their credentials, as we will never be able to retrieve our secret access key anymore.

Finally, select the fresh new user, and attach it to your AwesomeProjectPrivate policy. This cumbersome configuration is now completed. Let's write our last lines of Bash.

Uploading our Dumps to Amazon S3

To upload our files, we use the AWS-CLI tools. Installing them is as simple as:

sudo pip install awscli

After configuring user credentials through the aws configure command, we can complete our script with the following lines:

# Configuration
BUCKET_NAME="awesomeproject-private"

# [...]

# Upload on S3
/usr/bin/aws s3 sync --delete $DUMPS_FOLDER s3://$BUCKET_NAME

We synchronize our dumps folder (removing obsolete files) to our bucket. Do not forget to use a full path to the AWS binary. Otherwise, you may get some errors when launched from a Cron job.

Final Script

For the record, here is the full script:

#!/bin/bash

# Configuration
CONTAINER_NAME="awesomeproject_pgsql"
FILENAME="awesomeproject-`date +%Y-%m-%d-%H:%M:%S`.sql"
DUMPS_FOLDER = "/home/awesomeproject/dumps/"
BUCKET_NAME="awesomeproject-private"

# Backup from docker container
docker exec $CONTAINER_NAME sh -c "PGPASSWORD=\"\$POSTGRES_PASSWORD\" pg_dump --username=\$POSTGRES_USER \$POSTGRES_USER > /tmp/$FILENAME"
docker cp $CONTAINER_NAME:/tmp/$FILENAME $DUMPS_FOLDER
docker exec $CONTAINER_NAME sh -c "rm /tmp/awesomeproject-*.sql"

# Keep only 7 most recent backups
cd $DUMPS_FOLDER && (ls -t | head -n 7 ; ls -t) | uniq -u | xargs --no-run-if-empty rm
cd $DUMPS_FOLDER && bzip2 --best $FILENAME

# Upload on S3
/usr/bin/aws s3 sync --delete $DUMPS_FOLDER s3://$BUCKET_NAME

Cron Task

Finally, let's cron our script to automatically save our data each day. Add the following file into /etc/cron.d/awesomeproject:

# Daily DB backup
00 23 * * * awesomeproject /home/ubuntu/awesomeproject/bin/db-save.sh >> /home/ubuntu/cron.out.log 2>&1

It launches the script every day at 11pm as awesomeproject user. Prefer absolute paths to relative one. It will save you some serious headaches.

Another useful tip when you deal with crons is to redirect their output to a file. It is really essential for debugging. Here, we redirect standard output to the cron.out.log file (via the >> operator), and redirect the error output to the standard output, thanks to 2>&1.

We can now enjoy the security of daily database dumps. Great!