Hey!
Over the last few days I've been investigating different options for
performing backup and restore for Keycloak Operator. I've been discussing
different parts of this functionality with some of you, and now I'd like to
bring everybody on the same page.
Grab a large cup of coffee and let's dig in!
1. Old Operator's behavior
The Old Keycloak Operator used to use a CronJob to schedule database backup
and upload its result into AWS S3 [1]. However, it seems (at least I can
not find it) the Operator doesn't perform a restore operation.
The biggest advantage of this approach is consistency (across other
Operators created by Integreately Team [2]). But there are some downsides
are well - it's limited to AWS, it uploads large images across wide open
Internet, there's no way to ensure any retention policy (as we just upload
the backup and forget about it).
3. Related work
There's ongoing work on Persistent Volumes Snapshots [3], which is targeted
to Kubernetes 1.16, which is probably OpenShift 4.3+ [4]. The functionality
allows us to create a Persistent Volume Snapshot, which could be used as a
backup. Later on, we could use such a snapshot to mount it into Postgresql
Pod. However, I'm not sure what will happen with the in-flight transactions
when we take such a snapshot from Postgresql. Once the snapshot
functionality is in, we can test it out.
I asked the Storage SIG if there are some plans to create an automatic tool
that makes creating backups for the whole cluster a bit easier but I've
been told the Storage SIG will create only building blocks (like the
snapshots) for it. Nothing more.
The Snapshot functionality could be taken one step further and we could
imagine creating a backup for whole namespace (with all ConfigMaps, Secrets
etc). It seems there's no out of the box tool that does this. The closest
is Valero [5] by Heptio/VMware. I've also heard (from my private channels)
about companies selling closed-source projects like that.
The Persistent Volumes in Kubernetes use Storage Classes [6] to indicate,
what underlying storage mechanism to use. Some of them (like GlusterFS or
AzureDisk) may natively support backups. Externally configured backups
might be tightly connected to budget (e.g. in AWS the slower the disk, the
less money you pay for it).
4. An implementation idea with pg_dump
I did some experiments with spinning up a new Pod (or a Job), with a brand
new Persistent Volume and using gp_dumpall utility for backing up
Postgresql [7]. We may also let the user decide on the storage class at
this point (e.g. use a slow and cheap storage for backups).
The idea is to act when a user creates a KeycloakBackup CR:
apiVersion:
keycloak.org/v1alpha1
kind: KeycloakBackup
metadata:
name: example-keycloakbackup
spec:
# This field will be used for restoring a backup, I will explain it a bit
later
#restore: true
instanceSelector:
matchLabels:
app: keycloak
This triggers an Operator to create a Pod (or a Job) with a new Persistent
Volume mounted and use pg_dumpall to create a backup. Once the backup is
created, we leave the Persistent Volume in the cluster. Sending it to an
external storage would use a user's responsibility. At this point we also
don't care about periodic backups. If someone wishes to create them - he
needs to create a CronJob that will be creating KeycloakBackups on his
behalf (here's a link showing how to call Kubernetes API from a Job/Pod
[8]). Once a user decides to restore a backup, he just sets the restore
flag to `true`. Then the CR is in its terminal state - you can't do
anything with a restored backup.
This solution has some advantages - it creates a nice 1:1 mapping between a
CR and a backup. It also maintains this mapping with each restore. Finally,
we don't need to care about retention policy or scheduled backups - it's
users (or K8s admin) responsibility to do that. We just create an
additional Persistent Volume that contains a database backup. Of course,
lack of retention policy and scheduling might be considered as a drawback -
it's a valid point of view.
5. Integreately Team requirements
@David Martin <davmarti(a)redhat.com> sent me a set of requirements for the
Keycloak Operator around backups:
- The operator can do backups (scheduled and manually triggered)
- The backup process should push resources offsite. Otherwise I have to
code that bit. If it doesn't do this, why bother with any backup logic at
all
- The operator should make the configuration of this as easy as possible
e.g. allow a schedule to be configured, and a location/credentials to push
- The operator should make a restore easy to do e.g. point to offsite
location & credentials
- Not directly related to this thread, but mentioning it for the larger
picture. Problems with the backup should trigger an alert in Prometheus..
6. Final thoughts
Unfortunately there's no ultimate solution for backing up the whole
namespace yet. We are at the point where all necessary building blocks are
being built as we speak (like Persistent Volume Snapshots). But we're
months if not years from the final solution.
I like the idea of backups I explained in #4. I believe it's very
extensible but it doesn't fulfill most of the requirements from David's
list. However, we could do a few tricks to make the situation a bit better:
- we might implement a CronJob that will be creating KeycloakBackup
according to the given schedule.
- we might reuse (or slightly modify) the Integreately upload utility to
support Persistent Voluments that already contain a backup. In other words
- the utility will need to skip the pg_dump call.
Alternatively - we may take the path of least resistance and implement
backups the same way as in other Operators but separating the
implementation with a clean and nice interface (so that we could extend it
in the future).
Thanks,
Sebastian
[1]
https://github.com/integr8ly/keycloak-operator/blob/d4aa7f0fdcf765b578ed1...
[2]
https://github.com/search?q=integreatly%2Fbackup-container&type=Code
[3]
https://kubernetes-csi.github.io/docs/snapshot-restore-feature.html
[4]
https://blog.openshift.com/wp-content/uploads/Red-Hat-OpenShift-4.0-Roadm...
[5]
https://velero.io/
[6]
https://kubernetes.io/docs/concepts/storage/storage-classes/#the-storagec...
[7]
https://github.com/slaskawi/keycloak-operator/blob/INTLY-3367-Backups/bac...
[8]
https://kubernetes.io/docs/tasks/administer-cluster/access-cluster-api/#w...