Deploy Applications in Kubernetes

Learn the ins and outs of deploying a Nuxt frontend, Go api and Postgres database in kubernetes.

This is the second part of the kubernetes(k8s) series where we will focus on deploying three applications inside a k8s cluster namely a Postgres database, a Go API server, and a Nuxt frontend single-page-application (SPA).

Along the way, we will learn about typical k8s objects including Deployment, Service, PersistentVolume, Secret, but also related stuff like port-forwarding, authorization, building container images, and many others. Just like the first part, this post is also going to be elaborate containing many details because k8s is a behemoth, and I think it is a good idea to look a little bit under the hood to get a better understanding of how all the pieces work together. What we will do is to simply break it down into smaller parts, and you will see that using a handful of yaml files were enough to easily deploy applications on k8s.

Circling back to the applications, we are going to deploy three items:

  1. A frontend written in Nuxt 3 that you will see and interact with

  2. A database using Postgres that will store our data

  3. A backend api server written in Go

The frontend will be a simple interface where it will list the latest ten UUID items which are stored from a database. We will add an ‘add’ button where a new random UUID is generated and stored into the database. Once that is done, it will do another api call to retrieve the latest UUIDs.

This is what it’ll look like

A gif animation showing a demonstration of a Vue3 SPA website with a list of the last ten random UUIDs with an ‘add’ button. Upon clicking that button, a new UUID is appended to the list. Figure 1: Demonstration of the final product

Since all three applications are going to live inside k8s, we need a way to access them. To expose the applications, we will port-forward a couple of applications to our local computer. Firstly, the frontend needs to be port-forwarded so that we can access it using the browser from local the computer. The frontend needs access to the api, so we will port-forward that too. Database does not need to be port forwarded because it is only going to interact with the api within the cluster. There is an important step regarding the address we need to use for api to access the database which we will see later.

Diagram showing overall architecture with database, vuejs, and go api in the cluster. Api connects to DB. Both Vuejs and API are port forwarded to local computer. Figure 2: Overall architecture of the applications

Finally, the backend api server acts as an intermediary between the frontend and the database. Although its purpose is simple, we need to think about from where will this api read configuration settings. When deploying a new update to the api, we want to ensure in-progress requests not be cancelled out, so we need a graceful way of shutting down the program. By extension, k8s needs to know if the api is running, otherwise it needs to route incoming requests to another Pod. If there are any database schema changes, that needs to be applied as well. As you can see, there are many things need to be considered and for that, some principles of 12-factor apps are going to be applied. Lots of questions need to answered, and we will look at designing such application.

~~

To follow along this post, these are the three repositories you may clone and play with. Devops repo is intentionally separate from api and web with the idea that cluster admin manages k8s side and database, while full-stack developers only concern with api and frontend. This blog post will assume you are part of a devops team. Sections with a fullstack engineer role will be made known in the code block commented with ### Run by user dev007..

  1. K8s administration: https://github.com/gmhafiz/k8s-devops (includes database deployment)
  2. Api: https://github.com/gmhafiz/k8s-api
  3. Web: https://github.com/gmhafiz/k8s-web

This is part of a series of blog posts including:


Table of Contents


Prerequisite

A working k8s cluster: see see part 1, k3s, kind, minikube, etc.

Database

We need a database that can persist our random UUID data so our api can access and store them. For that, let us deploy a postgres database in the k8s cluster. There are several k8s objects we require for this setup. We will see that k8s tends to use the terms ‘object’ and ‘resource’ interchangeably, as will I. First up is configuration.

Configuration

Before creating a database, we need to set credentials as variables such as username, password, and database name. There are various ways to do this but the simplest is to simply set them as environment variables — some might argue this is a best practice. Anyhow, let us look at a first approach, which is a ConfigMap.

The following is a k8s object called ConfigMap named as ‘db-credentials’. This is the name to identify this particular ConfigMap object. You will see a lot of yaml files from now on. These files are written in a declarative way that states what we want the final picture to look like — we declare that we are setting my_db as the database name, etc…

# /k8s-devops/db/configmap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: db-credentials
  labels:
    app: postgresdb
data:
  POSTGRES_DB: "my_db"
  POSTGRES_USER: "user"
  POSTGRES_PASSWORD: "Password" # sensitive information should not be left unencrypted!
  PGDATA: "/var/lib/postgresql/data"

As you can see there is a glaring issue with this because the database password is in plain sight so this file cannot be committed to a public repository.

We can use another k8s object called Secret. Any sensitive data including passwords, a token, or a key can be stored in this Secret object. So we remove POSTGRES_PASSWORD from ConfigMap, and we use a Secret object instead as follows:

# /k8s-devops/db/secrets.yaml

apiVersion: v1
kind: Secret
type: Opaque
metadata:
  name: postgres-secret-config
data:
  POSTGRES_PASSWORD: UGFzc3dvcmQ=

This looks much better because now you do not see the sensitive data in plain sight. However, the value is actually a base64 encoded password using echo -n "Password" | base64 command. That means, anyone with access to the cluster with the right authorization will be able to decode the string. Try it using the following commands:

kubectl edit secret postgres-secret-config

returns

# Please edit the object below. Lines beginning with a '#' will be ignored, 
# and an empty file will abort the edit. If an error occurs while saving this file will be 
# reopened with the relevant failures. 
# 
apiVersion: v1 
data: 
  POSTGRES_PASSWORD: UGFzc3dvcmQ= 
kind: Secret
type: Opaque
metadata: 
  creationTimestamp: "2023-07-02T06:58:48Z" 
  name: postgres-secret-config 
namespace: default
...

and

echo 'UGFzc3dvcmQ=' | base64 --decode

# returns
Password

Clearly this approach is unacceptable because not only is the password accessible within the cluster, but we cannot commit this to a public source control repository as any person with access to this repository will be able to read the password. Fortunately, this is not the end of the road. We can still salvage this situation by taking a step further by encrypting this Secret object by using a third party tool called SealedSecrets from Bitnami. Like many cases, you often have other choices. They are Vault from Hashicorp, and sops from Mozilla. To do this, we need to install its controller to our k8s cluster, and we need its accompanying cli called kubeseal.

Sealed Secrets

Install kubeseal to /usr/local/bin/kubeseal.

mkdir -p ~/Downloads/kubeseal
cd ~/Downloads/kubeseal
wget https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.22.0/kubeseal-0.22.0-darwin-amd64.tar.gz
tar -xvzf kubeseal-0.22.0-darwin-amd64.tar.gz
sudo install -m 755 kubeseal /usr/local/bin/kubeseal 

Install SealedSecret controller by applying this controller.yaml file.

kubectl apply -f https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.22.0/controller.yaml

Now we have everything we need (cli, controller, and Secret object) to create a SealedSecret object. We choose to install this SealedSecret controller to the default cluster controller which is kube-system. Format can be either json or yaml. We choose yaml to be consistent with other files.

cat secrets.yaml | kubeseal \
--controller-namespace kube-system \
--controller-name sealed-secrets-controller \
--format yaml \
> sealed-secrets.yaml

Kubeseal uses a public key from this controller to encrypt the data. It can only be decrypted by the cluster because only this cluster has access to public key’s corresponding private key.

Result is as follows.

# /k8s-devops/db/sealed-secrets.yaml

apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
  creationTimestamp: null
  name: postgres-secret-config
  namespace: default
spec:
  encryptedData:
    POSTGRES_PASSWORD: AgCAJxd4LMUD9O18E5ytJ07a6IHuaDqeIDCi0GfByXQykSV+2zvK2e0xApIft6kjPHlEelmMerfSA2INDWLniFq/PKIvjwIcy7a5s1EVNLpzSK1VHzdSJBDunnOCIqxSL1G18+py4gtLrBYwhBrl+yCrArqPG3HjFwytP7uyH1wuymWKKcWIVR41WO8yyC5le9guAwdRIG2ltKft3d2w64ayRpEW0JNrgeLktlS4wvsAgPKFvU2SqKBLMK2eShFoEmbng35xqo9Zu98nzq/iDocExfjAxWUlVDm6O2d42WKKOyrVZ8uNeZvgHlWA7NkfFucOVeI4QrEc7duKB5sYTG8ipDOT3u6r9iTCbw4lQG9JRlWqKnJWq9Wk7pwrSWOklBBxj+sYAPh6BPL4fDm9PXjKfkSbt0x5qIz9KfRYQmmslKtH7ym+WXQAx1H/Ba91k5e5Eh+2437nqbJZYNmqdfV0U5xodOdgC5gKZc4+yr9shmj24FaPo8tmJupDiO8+4014Q0rjJx78b/bahJWw0BzrgWO/xD2HlyeboJXMi4cch0IrIbSei3V6UaIUd61ybIJkyJKiY6QQesR0pCQQcOP1wLxEz3cMZBY0PxAzpaRVDqRlCxBLcxD8sVZYxTl0R0G9exhBUyl9hsPYPP6Zvnw+2DlIddnuRID5ryfpleh+7K4GU4JHVpSBrEFuEuzWhP/v0KfkVOMkVsG2VQ==
  template:
    metadata:
      creationTimestamp: null
      name: postgres-secret-config
      namespace: default

Looking food so far. The value of .spec.encryptedData.POSTGRES_PASSWORD is now encrypted. Let us apply this Secret along with our ConfigMap.

kubectl apply -f configmap.yaml
kubectl apply -f sealed-secrets.yaml

If you inspect this Secret, you will be able to see POSTGRES_PASSWORD’s base64 encoded value in plain sight. Anyone who can access your cluster with the right permission can still view the password with the following command.

kubectl edit secret postgres-secret-config

returns…

# Please edit the object below. Lines beginning with a '#' will be ignored, 
# and an empty file will abort the edit. If an error occurs while saving this file will be 
# reopened with the relevant failures. 
# 
apiVersion: v1 
data: 
  POSTGRES_PASSWORD: UGFzc3dvcmQ= 
kind: Secret
type: Opaque
metadata: 
  creationTimestamp: "2023-07-02T06:58:48Z" 
  name: postgres-secret-config 
  namespace: default 
  ...

… and we get its base64 encoded string. Are we back to square one? Let us say that the devops (you) is not the only person who can access this cluster. You might allow a developer to also access this cluster so that they can deploy their api. However, you may not want to expose the database password to them. If that is the case, you may choose an authorization system like RBAC to restrict access.

Remember that our api server itself will get access to the database password because otherwise, the api cannot talk and to it and store data! Since the Secret object cannot be accessed by developers, one thing they can try is to sneak in a code that reads the password to be sent elsewhere — if code review did not manage to flag that of course! If it is okay for the developer to know the production database password, none of this matters and SealedSecret is sufficient, and you can let the developer access the cluster normally.

In conclusion, you need to create a threat model assessment before deciding on what to do. There are a lot more details about this threat model process so whatever security decisions you want to make, it needs to be done properly.

Volumes

We need to create a storage space for the database to store data. We will create two objects which are PersistentVolume (PV), and PersistentVolumeClaim (PVC). PersistentVolume is how we create a ‘space’ for the data to live. PersistentVolumeClaim on the other hand is how we request a portion of space from that PersistentVolume. We will only create 5Gi which is more than enough for our purpose. hostPath uses /mnt/data directory in one of the nodes to emulate network-attached storage. This is a bad idea to use in production because if you lose this node, all data will be gone. For production setup, use a proper external storage like EBS or NFS. More details in kubernetes’ PV documentation.

# /k8s-devops/db/pv.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: postgres-pv-volume
  labels:
    type: local
spec:
  storageClassName: manual
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: "/mnt/data"

Next we claim that much space using PersistentVolumeClaim. Claiming 5Gi means we have exhausted all available storage.

# /k8s-devops/db/pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-pv-claim
spec:
  storageClassName: manual
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 5Gi

Note the PersistentVolumeClaim name, ‘postgres-pv-claim’, is important for our next object which is Deployment to refer to.

Apply with

kubectl apply -f pv.yaml
kubectl get pv postgres-pv-volume 

# returns
NAME                 CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   REASON   AGE
postgres-pv-volume   5Gi        RWX            Retain           Available           manual                  5s

Notice that its status is set to available. We will see how that changes once our claim succeeds.

kubectl apply -f pvc.yaml
kubectl get pv postgres-pv-volume 

# returns
NAME                 CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                       STORAGECLASS   REASON   AGE
postgres-pv-volume   5Gi        RWX            Retain           Bound    default/postgres-pv-claim   manual                  45s

You can see it is changed to ‘Bound’. The ‘Claim’ column now shows which PersistentVolumeClaim has claimed it.

HostPath

Another important thing you need to remember especially when we are using HostPath is postgres data will persist even when you delete your PersistentVolume. Verify with

kubectl delete pvc postgres-pv-claim
kubectl delete pv postgres-pv-claim

Find out on which node PersistentVolume stores the data and navigate to /mnt/data (see PersistentVolumeClaim above).

# SSH into each node
ssh kubeadmin@172.16.16.20{1,2,3} # IP address depends on how you create or access your cluster. 
sudo ls -la /mnt/data

In my particular instance, PersistentVolume mount was created on node 172.16.16.201

kubeadmin@kworker1:/$ sudo ls -la /mnt/data/
total 132
drwx------ 19  999 root             4096 Jul 15 01:12 .
drwxr-xr-x  3 root root             4096 Jul 14 16:14 ..
-rw-------  1  999 systemd-journal     3 Jul 14 16:14 PG_VERSION
drwx------  6  999 systemd-journal  4096 Jul 14 16:14 base
drwx------  2  999 systemd-journal  4096 Jul 15 01:07 global
drwx------  2  999 systemd-journal  4096 Jul 14 16:14 pg_commit_ts
drwx------  2  999 systemd-journal  4096 Jul 14 16:14 pg_dynshmem
-rw-------  1  999 systemd-journal  4821 Jul 14 16:14 pg_hba.conf
-rw-------  1  999 systemd-journal  1636 Jul 14 16:14 pg_ident.conf
drwx------  4  999 systemd-journal  4096 Jul 15 01:12 pg_logical
drwx------  4  999 systemd-journal  4096 Jul 14 16:14 pg_multixact
drwx------  2  999 systemd-journal  4096 Jul 14 16:14 pg_notify
drwx------  2  999 systemd-journal  4096 Jul 14 16:14 pg_replslot
drwx------  2  999 systemd-journal  4096 Jul 14 16:14 pg_serial
drwx------  2  999 systemd-journal  4096 Jul 14 16:14 pg_snapshots
drwx------  2  999 systemd-journal  4096 Jul 15 01:12 pg_stat
drwx------  2  999 systemd-journal  4096 Jul 14 16:14 pg_stat_tmp
drwx------  2  999 systemd-journal  4096 Jul 14 16:14 pg_subtrans
drwx------  2  999 systemd-journal  4096 Jul 14 16:14 pg_tblspc
drwx------  2  999 systemd-journal  4096 Jul 14 16:14 pg_twophase
drwx------  3  999 systemd-journal  4096 Jul 14 16:14 pg_wal
drwx------  2  999 systemd-journal  4096 Jul 14 16:14 pg_xact
-rw-------  1  999 systemd-journal    88 Jul 14 16:14 postgresql.auto.conf
-rw-------  1  999 systemd-journal 29525 Jul 14 16:14 postgresql.conf
-rw-------  1  999 systemd-journal    36 Jul 15 01:06 postmaster.opts

The data is still here even when PersistentVolume was already deleted. Keep this in mind when updating your password in your Secret object, because the password needs to be updated within Postgres as well.

Deployment

The last two remaining objects are Deployment and Service. Deployment object runs our database as a Pod while Service exposes the application, so other apps in the cluster can communicate with it. A Pod is the smallest worker unit in your cluster. Inside, you can have one or more containers. If you have more than one, these containers are always together sharing resources and scheduled as a single unit. To scale your application up, you can increase the number of replicas in your Deployment and in turn the number of Pods will correlate with how many replicas you set.

Here we name our Deployment as ‘postgres’ with a single Pod enforced by replicas: 1; so only one Pod for this one Deployment. The container image is postgres:16.0. I like to explicitly state its major and minor version instead of using the latest tag because it will be easier to reproduce when you run into an issue.

For configuration values, we load them from environment variables in two ways. .spec.template.spec.containers[0].envFrom loads our ConfigMap object by name while .spec.template.spec.containers[0].env loads our database password from the Secret object.

#  /k8s-devops/db/deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: postgres
  name: postgres
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - image: postgres:16.0
          name: postgres
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 5432
          envFrom:
            - configMapRef:
                name: db-credentials
          env:
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: postgres-secret-config
                  key: POSTGRES_PASSWORD
          volumeMounts:
            - mountPath: /var/lib/postgresql/data
              name: postgres-pv-storage

      volumes:
        - name: postgres-pv-storage
          persistentVolumeClaim:
            claimName: postgres-pv-claim

The .spec.template.spec.containers[0].env.name has to match what postgres expects as an environment variable which is POSTGRES_PASSWORD. The .env.valueFrom.secretKeyRef.key, on the other hand, although it has the same string, has to match with what we have in sealed-secrets.yaml

#  /k8s-devops/db/deployment.yaml

env:
...valueFrom:
    secretKeyRef:
      name: postgres-secret-config
      key: POSTGRES_PASSWORD <- must match with
#  /k8s-devops/db/sealed-secrets.yaml

spec:
  encryptedData:
    POSTGRES_PASSWORD: AgCAJxd4LMUD9O18E5yt... <- match with this

Next, the Postgres default volume mount which is /var/lib/postgresql/datawill use PersistentVolumeClaim that we already have created.

Apply with

kubectl apply -f deployment.yaml

Check that your database is up by using kubectl’s get subcommand followed by the object (Deployment) and its name (postgres).

kubectl get deployment postgres

# returns
NAME       READY   UP-TO-DATE   AVAILABLE   AGE
postgres   1/1     1            1           3m49s

If we check how many Pods we have, we only get one, as expected due to replicas: 1.

kubectl get po
 
# returns
NAME                        READY   STATUS    RESTARTS   AGE
postgres-5598db5bb4-jznqd   1/1     Running   0          11m

Service

A Service object is how Pods and Deployments can be accessed. It creates a corresponding DNS entry for other applications to refer to. The DNS is created in a format of <service-name>.<namespace-name>.svc.cluster.local so that means our DNS entry will be postgres.default.svc.cluster.local.

The Service name can be arbitrary, but has been set as ‘postgres’ like the Deployment name, just to make it confusing.

There are four types of services which are NodePort, ClusterIP, LoadBalancer, and ExternalName. If not specified, it will default to ClusterIP. ClusterIP means this Service is mapped to an IP address across nodes, but only available within the cluster. NodePort makes it available to the outside by exposing a static port on each node, then forwards requests to ClusterIP. LoadBalancer type is used with external load balancer with an Ingress object, or managed k8s clusters such as Amazon’s or Google’s. ExternalName maps a Service to a DNS name.

#  /k8s-devops/db/service.yaml

apiVersion: v1
kind: Service
metadata:
  name: postgres
  labels:
    app: postgres
spec:
  type: ClusterIP
  ports:
    - port: 5432
  selector:
    app: postgres

Port is 5432 which is postgres’ default port number. Of course this setting can be changed using the ConfigMap technique we have seen.

Apply with

kubectl apply -f service.yaml

Check with

kubectl get svc

# returns
NAME         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
kubernetes   ClusterIP   10.96.0.1       <none>        443/TCP    10d
postgres     ClusterIP   10.105.21.101   <none>        5432/TCP   3s

At this stage, the database can be accessed within the cluster. I think it is important to attempt this because it can dispel some of the magic k8s handling of networking so let us try and do that. The IP address 10.105.21.101 we see above is curious, so maybe we can do something with it.

Let us try doing a simple netcat to show that the database listens to the address and port we think it is. First, SSH into one of the worker Pods. If you followed my previous post, it is one of the three of 172.16.16.20{1,2,3}.

ssh kubeadmin@172.16.16.201

If we run a netcat program (nc) with verbose output flag

nc -v 10.105.21.101 5432

# returns
10.105.21.101: inverse host lookup failed: Unknown host
(UNKNOWN) [10.105.21.101] 5432 (postgresql) open

Only the last line is relevant, and it shows that port 5432 is indeed opened. If you SSH to other nodes, it will show the same. This is because we defined a Service with ClusterIP type — this single IP can be used to access our database. In reality however, our (micro)services are not going to reference this IP address because it can always change. Instead, we use its DNS entry which is postgres.default.svc.cluster.local. The short version, postgres, can be used as long as you are calling from the same default namespace.

~~

Using netcat is a success, but it only shows the port is opened. A stronger evidence to show this is working is to connect using a postgres client like psql. The deployed database does not include this tool, but we can run a one-off Pod that runs psql command and attempts to log in. We need to retrieve database the password and store it in a variable. The secret is encoded with base64, so we need to pipe it with a decoder.

POSTGRES_PASSWORD=$(kubectl get secret --namespace default postgres-secret-config -o jsonpath="{.data.POSTGRES_PASSWORD}" | base64 --decode)

Without -o jsonpath..., it will return a full JSON file. We can traverse the file using the k8s way with {.data.POSTGRES_PASSWORD}. It is almost similar to how jq does but there is a dot in front.

Next we run this Pod named with psql-client that deletes itself after it exits using --rm and --restart='Never'. It is critical that we set the password to PGPASSWORD environment variable along with other parameters. The user, database name, and port are what we expected. What is interesting is the host. We did not specify any namespace in the command below, so implicitly we are running under the default namespace. So to access the database, we simply use the name of the Service which is postgres. Accessing using this DNS name is the important part mentioned in the beginning of this post.

kubectl run psql-client \
    --rm \
    --tty \
    -i \
    --restart='Never' \
    --namespace default \
    --image postgres:16.0 \
    --env="PGPASSWORD=$POSTGRES_PASSWORD" \
    --command -- psql --host postgres -U user -d my_db -p 5432

You will see the psql prompt. Success!

If you don't see a command prompt, try pressing enter.
my_db=# 

From here, you can try out a command

my_db=# SELECT TRUE;
 ?column? 
----------
 t
(1 row)

Exiting (type \q) will delete the Pod

my_db=# \q
pod "psql-client" deleted

Should You Use A Database in Kubernetes?

The big question now is whether it is a good idea to use this database. For demonstration purposes, yes. But there are many considerations that we need to think about using a database inside a cluster. Data needs to persist as such that it should not be gone the next time we recreate our cluster. In the same vein, backing and restoring are an important aspect of database management. It also needs to be scalable, whether you are opting for highly-available (HA) or you want a separate write and read node. You also want a monitoring solution so you know if there are slow queries bringing down your applications. Many times it is easier to use a managed database solution rather than hosting on your own.

Things to consider when self-hosting a database

  • Can Backup AND Verify Restore
    • Includes stored procedures, triggers
    • A backup is useless if a restore has not been verified
  • Upgrade with zero downtime
    • Minor and major version
  • Security and TLS
  • Scale horizontally
    • Replicas
    • Read and Write Nodes
  • High Availability
    • Failover
    • Uptime - five nines
  • Monitoring
  • Cloning
  • Connection Pooling
  • Partitioning / Sharding

Nevertheless, deploying a proper database in k8s properly is certainly possible. There are several projects that have the above features integrated like CrunchyData Postgres Operator, Zalando, Cloud Native PG, and others.

We did not go full throttle with those solutions but deploying this way gives a lot of opportunities to learn various k8s objects particularly PersistentVolume, PersistentVolumeClaim, and Secret.

Backend

For our api backend, it is going to listen and process requests to and fro the frontend. So this part needs to be port forwarded. Within the cluster, the api needs to communicate with the database. From the previous section, we already know how. We are not going to use its IP address because that can change. Instead, we need to use its DNS entry which is postgres. Lastly, database migration needs to happen because there’s no table created yet.

Why Go

The backend is going to be a Go program just to simplify things. A Go program can compile to a self-contained single statically-linked binary. That means you can simply hand over this single binary path to systemd, or even type ./api and it will run without any external dependencies. Go’s built-in standard library is extensive and covers many use cases meaning for our simple api, we do not need many third-party libraries or use any frameworks to create a production-grade api server. In fact, this api only uses two external libraries — one for Postgres database driver, and another for database migration. All other things including router multiplexer (mux), cors middleware, reading config, graceful shutdown, and HTTP server are handled by the standard library. However, to be fair this api is a simple program without both authentication and authorization, no input validation, no caching, and no testing.

Only snippets are shown here. For the full code, please visit https://github.com/gmhafiz/k8s-api

Entrypoint

The api’s entry point is located at cmd/api/main.go. We create a new server that reads configuration from environment variables, sets up the router, and connects to the database. We have three routes; a /healthz to see if the api is up, a /ready endpoint to see if it has a database connection by running SELECT true;, and a /randoms endpoint that lists the latest ten UUIDs as well as to create a new one. We also have CORS middleware set up because the origin address and port from the frontend (:8080) is going to be different from this api (:3080).

Container Probes

// /k8s-api/cmd/api/main.go

func main() {
	ctx := context.Background()
	srv := app.New(ctx)

	srv.Mux.Handle("/healthz", recovery(cors(http.HandlerFunc(srv.Healthz))))
	srv.Mux.Handle("/ready", recovery(cors(http.HandlerFunc(srv.Ready))))
	srv.Mux.Handle("/randoms", recovery(cors(http.HandlerFunc(srv.Randoms))))

The two endpoints, /healthz and /ready are typically found in an api program to be used with k8s. In a Pod lifecycle, a Pod is in a Running state when the Pod has been bound to the node, all containers have been created (we only have one), and at least one container is running. These endpoints are used by container probes as extra feedback to let k8s know if this Pod is ready to accept user requests. A particular Pod may not have a connection to the database, so we might prefer to route incoming requests to another Pod that is ready. K8s queries /ready endpoint (that does SELECT true;) to find out if it should route the requests or not. The /healthz endpoint on the other hand is used in conjunction with ’livenessProbe’. This probe is used to indicate if the container (our api) is running. In a situation where the api crashes, liveness probe will return a fail result, and it will kill the container. We designed our api in such a way that it can recover from a panic (it’s like an exception but in Go) so hopefully a crash will not happen. More details on the type of probes here.

Environment Variables

The app connects to the database using a connection pool. All of these settings are read from environment variables.

// /k8s-api/database.go

func NewDB(ctx context.Context, cfg Database) (*pgxpool.Pool, error) {
	dsn := fmt.Sprintf("postgres://%s:%d/%s?sslmode=%s&user=%s&password=%s",
		cfg.Host,
		cfg.Port,
		cfg.Name,
		cfg.SslMode,
		cfg.User,
		cfg.Pass,
	)

	return pgxpool.New(ctx, dsn)
}

Moving on, the api listens for requests at the address host:port initiated by the function server.ListenAndServe() in a separate goroutine which is done by calling it inside an anonymous function prepended with go keyword.

// /k8s-api/cmd/api/main.go

...continue from main()

addr := fmt.Sprintf("%s:%d", srv.Api.Host, srv.Api.Port)
log.Printf("running api at %v\n", addr)

server := &http.Server{Addr: addr, Handler: srv.Mux}
go func() {
    if err := server.ListenAndServe(); !errors.Is(err, http.ErrServerClosed) {
        log.Fatalf("ListenAndServe(): %v", err)
    }
}()

Graceful Shutdown

Imagine our api is chugging along and handles traffic normally, and then we want to do an update to it. We create a new commit, push to a repository, image is now built and now k8s want to deploy this new version to the cluster. K8s creates new Pods with the new version in the cluster and now traffic goes to both Pods containing old and new versions of the api. Once all (or some, depending on deployment strategy) new Pods are up, old Pods will be shut down. Two questions need to be asked, what happens to requests that are still being processed in the api after k8s wants to shut the Pod down?; and would not the api needs to free (close) database connection so others can use it?

If the Pods are killed abruptly to make way for new Pods, requests that have not been completed will also be immediately terminated and users will get an error. This is definitely not a desirable behaviour, and we need to remedy this.

When k8s wants to shut a Pod down, it first sends an operating system signal called SIGTERM to the Pod and this marks the beginning of a default 30-second grace period. We need the api to capture this signal, so it knows not to accept any more requests and begin the process of shutting down which includes taking the chance to close all remaining resources like closing database connection, other HTTP calls, opened files, or other resources.

To implement graceful shutdown, we create a buffered channel to catch all terminate (SIGTERM), stop (SIGSTOP), and interrupt (SIGINT) operating system signal to a variable called stop. Once the channel receives any of these signals, it does two things. Firstly, the api’s HTTP server stops receiving any more new requests and secondly, the line <-stop unblocks and runs subsequent lines. First thing it does is it runs server.Shutdown() line with a timeout we define as five seconds. If everything goes well, the next piece of code which is database shutdown is run and finally api exits and the Pod is removed.

...continue from above

stop := make(chan os.Signal, 1)
signal.Notify(stop, syscall.SIGINT, syscall.SIGSTOP, syscall.SIGTERM)

<-stop

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
if err := server.Shutdown(ctx); err != nil {
    log.Printf("error while shutting down: %v\n", err)
}

srv.DB.Close()

However, if k8s gets no response after the default of a 30-second grace period, k8s is going to do a hard kill using SIGKILL. This signal cannot be caught by the api process so any remaining resources cannot be closed gracefully. This is why it is important to do the graceful shutdown in the api ourselves.

If we zoom out and look at how k8s handles termination lifecycle, k8s will immediately set the Pod’s state as ‘Terminating’ and stops sending any more traffic to it. This is great which means k8s already does one job for us. Shutting down resources are still necessary though.

Note: In other applications where nginx is being used in front of an api, nginx will intercept that signal and will perform a graceful shutdown. However, the api still needs to close all other resources like database, network connections and others.

Note: Requests that take longer than the grace period will be cancelled and cause bad user experience. You can extend this grace period, but you have to evaluate if doing so is the correct approach.

Trivia: Ever had to use Ctrl+C to stop things in the terminal? you are actually sending a SIGINT signal! Ctrl+Z is SIGSTOP.

Migration

For database migration, we are supporting both one-level ‘up’ and ‘down’ operations. Which operation to run is determined by a flag that accepts either ‘up’ or ‘down’. We will see that through kubernetes, we can pass command line flags to determine the direction of migration.

//go:embed migrations/*.sql
var embedMigrations embed.FS

func (m *Migrate) Up() {
	if err := goose.Up(m.DB, "migrations"); err != nil {
		log.Panic(err)
	}
	...
}

func (m *Migrate) Down() {
	if err := goose.Down(m.DB, "migrations"); err != nil {
		log.Panic(err)
	}
    ...
}

All in all, we have built a simple production-grade api server that can list and create records, runs migration, and has graceful shutdown using only Go’s excellent standard library plus two external dependencies, and without any HTTP server like nginx in front of it.

Tagging

After talking in great length about the api design, we can move on to building container images. We will be making a new image for every git commit so that we can differentiate one image from another. One easy way is to use tags, which are simply strings that we can use to identify a particular commit or build. We need this to be unique-ish so there are no name conflicts when choosing which image k8s is going to pull. There are three (but not limited to) possible ways to construct this string. Firstly, we can tag the image the same as our api’s version, for example if our api version is v1.2.3, we also tag our image the same. Secondly, we can simply use our api’s latest git commit. Thirdly, a combination of the first two. Here, I am showing the second approach.

TAG=$(git rev-parse HEAD)
echo $TAG

This will yield its SHA hash that looks like this: 334ed70d2fcbdb6936eaef84b7385fab8b545b0a. Some people like a shorter version which can be done by adding --short parameter.

TAG=$(git rev-parse --short HEAD)
echo $TAG

# returns 
334ed70

The short version can look nicer when combined with api’s version for example gmhafiz/api:v1.2.3-334ed70.

Once the image is tagged, we can easily reference that from our kubernetes Deployment object. But before going into that, let us take a step back and build an image.

Containerise

What is left we need to turn this codebase into an image. We use a multi-stage technique for this so the final image only contains necessary files and thus keeping it small. In the first stage we name as build, we copy dependency files go.{mod,sum} and run go mod download to retrieve third-party dependencies. For subsequent builds, if there are no modification to the both dependency files, no download is necessary because it will simply use Docker cache making image building much faster.

#  /k8s-api/Dockerfile-api

FROM golang:1.21 AS build

WORKDIR /go/src/app/

# Copy dependencies first to take advantage of Docker caching
COPY go.mod ./
COPY go.sum ./
RUN go mod download

To compile, copy all the files and run go build command. The ldflags -w and -s strips debugging symbols because those are not needed for production. We name the final binary as api using the -o flag.

#  /k8s-api/Dockerfile-api

COPY . .

# Build Go Binary
RUN CGO_ENABLED=0 GOOS=linux go build -ldflags="-w -s" -o ./api cmd/api/main.go

That completes the first stage. In the second stage, we pick a minimal base image called distroless. We could have used scratch image that contains nothing because our Go binary is self-contained. However, a distroless image has the added convenience of included ca-certificate for external TLS URL call and timezone built into the image. Also, we use nonroot tag so our api runs in a rootless container. We then copy the binary from the previous stage to /usr/local/bin/api and set its entrypoint.

#  /k8s-api/Dockerfile-api

FROM gcr.io/distroless/static-debian12:nonroot

LABEL com.example.maintainers="User <author@example.com>"

COPY --from=build /go/src/app/api /usr/local/bin/api

EXPOSE 3080

ENTRYPOINT ["/usr/local/bin/api"]

To build this image, we supply the path to this file using the -f flag. This was only necessary because we picked a file name other than Dockerfile. We name this image as gmhafiz/api. The name must match with the repository path we will see later. For now, we simply use the format of <docker login name>:<repository name>. We also tag this image with its git SHA hash. Without it, it will default to latest. The dot in the end simply means current directory.

docker build -f Dockerfile-api -t gmhafiz/api:$TAG .

Once the build is done, we can see the created image by running docker images command.

docker images | grep gmhafiz

# returns
gmhafiz/api    334ed70d2fcbdb6936eaef84b7385fab8b545b0a    799e8933a899   5 seconds ago   11.5MB

The resulting image only contains our single compiled binary along with what distroless image provided — thanks to the multi-stage approach. In the final image, there are no compile tools, operating system, shell or other stuff not needed to run this api resulting in a relatively small image size at around 11 megabytes.

One other thing we need to do is to create the database table and subsequently populate some random UUIDs. Its Dockerfile is identical to the api’s except it has a different binary name, and thus a different entrypoint.

# #  /k8s-api/Dockerfile-migrate

FROM golang:1.21 AS build

WORKDIR /go/src/app/

COPY go.mod ./
COPY go.sum ./
RUN go mod download

COPY . .

RUN CGO_ENABLED=0 GOOS=linux go build -ldflags="-w -s" -o ./api cmd/migrate/main.go

FROM gcr.io/distroless/static-debian12:nonroot

LABEL com.example.maintainers="User <author@example.com>"

COPY --from=build /go/src/app/migrate /usr/local/bin/migrate

ENTRYPOINT ["/usr/local/bin/migrate"]

To build, pick a different name, like gmhafiz/migrate

docker build -f Dockerfile-migrate -t gmhafiz/migrate:$TAG .

Container registry

For our cluster to retrieve this image, it needs to be stored somewhere. We can buy a domain name and reverse proxy to a container registry hosted in our server we own, or use ngrok if you do not want to buy a domain name yet. For now, the easiest way is to simply use a free solution like Docker hub registry. This registry is a special repository of images that k8s can pull images from. There are many other hosted registries like Github, Gitlab, from the big three cloud services, self-hosted from quay.io, docker registry, Harbor, peer-to-peer ones like Uber’s kraken, and many others.

First create a new repository from the web at https://hub.docker.com/repository/create. Note that the namespace and repository name is reflective of our image’s name of gmhafiz/api.

Create repository

Once the repository is created, you may push it.

docker push gmhafiz/api:${TAG}

When you refresh, you will see the image has been successfully uploaded and is ready to be downloaded by your cluster.

Show container image

Secrets

There is a reason why there are separate demo repositories you see at the top of this post. In this fictional case, the fullstack developer only develops the frontend and backend but fortunately has no access to the production database. Cluster management including managing database password is left to us, the devops. If you see in the applications/api directory, there is no Secret object, instead the api will reference the same Secret object as the database.

The api requires credentials to the database for it to work. However, the backend developer is not privileged to this information as they are sensitive. Only devops or database architect teams are privy to this information. The database sealed-secrets.yaml file is encrypted so the password cannot be retrieved by the api developer even when the files are hosted in a public repository. A potential issue is when api developers are also allowed access to the k8s cluster. There can be many reasons for this. We might allow api developers to see if their api is deployed. We might also allow them to scale up or down its Pods. However, giving users full access to the cluster by sharing the same credentials as cluster admin also means that the api developer can retrieve database password which we do not want. To prevent this, we need to apply an authorization system. Fortunately, Kubernetes has one built-in. Let us make a new user in the cluster with restricted access using Role-based Access Control.

Note: Username and password is not the only way for the app to access the database. We want the connection to be encrypted such as using TLS. That means creating a signed client certificate for the app. However, we still need to define variables like database port and host name.

RBAC

In a Role-based Access Control (RBAC), we apply a rule saying an entity(dev007) has a role(dev) to perform(deploy) a deployment. All this can be done with a k8s object called Role and RoleBinding. They way we design is following k8s’ default behaviour which is Principle of Least Privilege, which means by default, we deny everything, but we are going to add or allow specific rules (whitelist).

Create User

First step is to create a user. There is a great guide at https://docs.bitnami.com/tutorials/configure-rbac-in-your-kubernetes-cluster/ that creates a separate namespace that I refer to a lot, so I will breeze through here. What I am showing differs in such a way that this dev007 is going to be in the same ‘default’ namespace instead.

To create a user, we make use of Public Key Infrastructure (PKI). There are alternatives to PKI approach like external auth such as OPA Gatekeeper, Kyverno, OAuth, etc. The downside of PKI is we cannot revoke a user’s certificate without also revoking all other users’ access in the cluster.

The user dev007 needs to run a series of commands:

### Run by user 👷 dev007 only.

# Create a private key
openssl genrsa -out dev007.key 2048 # Very private, don't expose to anyone else except user dev007
chmod 400 dev007.key # ensure only dev007 has the permission to read this file.

# User dev007 creates a certificate signing request (CSR). The CN (Common Name) is the 
# important identity name.
openssl req \
    -new \
    -key dev007.key \
    -out dev007.csr \
    -subj "/CN=dev007"

The .csr is sent to the cluster admin which is us to be signed and to create an X509 client certificate.

### Run by 🦸🏽‍♂️ cluster admin

# Locate your cluster's CA (certificate authority) to sign this CSR.
# Location is in /etc/kubernetes/pki/ so you may need to SSH into one of the nodes.
# dev007 sends the csr file to cluster admin. And cluster admin signs it and returns dev007.crt
# file to that user.
openssl x509 \
  -req \
  -in \
  dev007.csr \
  -CA CA_LOCATION/ca.crt \
  -CAkey CA_LOCATION/ca.key \
  -CAcreateserial \
  -out dev007.crt \
  -days 7 # Best practice is to make it short, and renew.

We can inspect the content of the certificate with the following command and see the window period of validity.

openssl x509 -in dev007.crt -text -noout

# returns
Certificate:
    Data:
        Version: 1 (0x0)
        Serial Number:
            12:b5:db:df:38:6d:04:22:99:56:0b:ec:f4:54:c6:2d:a3:02:34:24
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN = kubernetes
        Validity
            Not Before: Jul 16 04:09:46 2023 GMT
            Not After : Jul 23 04:09:46 2023 GMT
        Subject: CN = dev007
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                RSA Public-Key: (2048 bit)
    ...
Logging In (Authentication)

dev007 is able to log in using the two files, this client certificate (dev007.crt) generated by k8s admin, and private key (dev007.key) generated by that user earlier. The first step is to create a new credential (new user) using set-credentials command:

### Run by user 👷 dev007.

kubectl config set-credentials dev007 \
    --client-certificate=/home/dev007/.certs/dev007.crt \
    --client-key=/home/dev007/.certs/dev007.key

Cluster name is kubernetes if you follow my previous guide of vanilla installation. If using minikube, the cluster name is minikube. Find out the name by looking at your ~/.kube/config file.

The user dev007 can set the context called ‘dev-context’ with the following:

### Run by user 👷 dev007.

kubectl config set-context dev-context --cluster=kubernetes --user=dev007

At this time, the user dev007 will not be able to access anything yet.

### Run by user 👷 dev007.

kubectl --context=dev-context get pods

# returns
Error from server (Forbidden): pods is forbidden: User "dev007" cannot list resource "pods" in API group "" in the namespace "default"

Because that is the default behaviour which is to deny all. You need to whitelist what the role that user dev007 belongs to is allowed to do.

Create RBAC Policy

We need two more objects, a Role and a RoleBinding. This Role named ‘dev’ is in the current default namespace because we did not specify one here. Under an array of rules, the first item is apiGroups. Empty string "" indicates this is the core API group.

#  /k8s-devops/role.yaml

kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: dev
rules:
- apiGroups: ["", "apps"]
  
  # we allow developers to see deployment, pods, and scale up or down 
  resources: ["deployments", "replicasets", "pods"]
  
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

We also add apps to this apiGroup. ‘apiGroups’ documentation is probably hidden from the official docs because I cannot find it using the search box, but we can list what ‘apps’ is for with the following command. We can see that ‘ReplicaSet’ is among them.

kubectl api-resources | grep apps

# returns
controllerrevisions                  apps/v1                      true         ControllerRevision
daemonsets              ds           apps/v1                      true         DaemonSet
deployments             deploy       apps/v1                      true         Deployment
replicasets             rs           apps/v1                      true         ReplicaSet
statefulsets            sts          apps/v1                      true         StatefulSet

If you notice, I tend to use the short form to save up typing. For example po for pods. deploy for deployment and so on. The kubectl api-resources lists these short forms to save some typing.

We want to allow dev007 to deploy, scale and see pod logs, so we add to the ‘resources’ list. Finally, we whitelist what this role (verbs) can do. Notice that there is no Secret object in the resource list. This prevents anyone with the role ‘dev’ from inspecting any Secret content.

To apply dev007 user to this role, we use a RoleBinding object. Since we only have one user so far, there is only one item is in the .subjects array.

#  /k8s-devops/role-binding.yaml

kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: dev-role-binding
subjects:
- kind: User
  name: dev007
  apiGroup: ""
roleRef:
  kind: Role
  name: dev # must match with Role's metadata.name
  apiGroup: ""

As the cluster admin, apply both with

kubectl apply -f role.yaml
kubectl apply -f role-binding.yaml

# returns
role.rbac.authorization.k8s.io/dev created
rolebinding.rbac.authorization.k8s.io/dev-role-binding created

If user dev007 retries the get pods command,

kubectl --context=dev-context get pods

# returns
NAME                        READY   STATUS    RESTARTS      AGE
postgres-5598db5bb4-jznqd   1/1     Running   1 (88m ago)   7d1h

It returns one database Pod as expected because this database has already been deployed. Now, let us try to access the Secret object.

kubectl --context=dev-context get secret

Error from server (Forbidden): secrets is forbidden: User "dev007" cannot list resource "secrets" in API group "" in the namespace "default"

As expected, the user cannot access the Secret object because the role ‘dev’ does not have access to it.

This concludes the Secrets section and in summary, this is what we have in this section:

User name    : dev007
context      : dev-context
role         : dev
role-binding : dev-role-binding

API Settings

That was a long detour from Secrets section. But jumping back to api, from its point of view, it needs to have the environment variable for database password, api’s host, and api’s port. We already have created a SealedSecret for database password, so we can reuse that. For the rest of those api settings, we can create a new ConfigMap. Remember that the api developer has no idea what these values are. Only the k8s devops person knows and manages these values.

API_PORT environment variable is going to be an integer 3080. The database port on the other hand is going to be the integer 5432. The rest are the database’s user, name and host. Notice that the DB_HOST value is ‘postgres’ which is the DNS entry for postgres service.

# As devops
# /k8s-devops/db/configmap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: api-parameters
  labels:
    app: api
data:
  # to read a value as integer, use its hex value from ascii command
  API_PORT: "\x33\x30\x38\x30"

  DB_USER: "user"
  DB_NAME: "my_db"

  DB_PORT: "\x35\x34\x33\x32"

  DB_HOST: "postgres"

In both API_PORT and DB_PORT, you can see some strange encoding. ConfigMap can only store string. It even cannot store an integer or boolean either. Our api however cannot accept a string as integer because Go is a statically typed language. We can modify our api to read the ports as strings, then convert to integers. We can consider putting DB_PORT into our Secret. The same goes with API_PORT, although it is not really a secret. Other ways include using a Helm chart, and injecting environment variables. ConfigMap allows us to embed a file so perhaps our api can parse a yaml file instead of reading configurations from environment variables.

All those options are a headache so one trick we can use here is to use values seen from the ascii with hex table flag switched on.

ascii

As you can see from the fourth column, the encoding for 0 or zero, is hex 30. For the number 5, go down the column, and you will see it is 35. To construct a series of integers, what we need to do is to escape the literal encoding for hex using \x and pick the numbers.

Decimal Result
5 \x35
4 \x34
3 \x33
2 \x32

Thus, the integer 5432 then becomes \x35\x34\x33\x32.

Apply with

kubectl apply -f configmap.yaml

At this stage, you should have three ConfigMaps in the cluster.

kubectl get cm

# returns

NAME               DATA   AGE
api-parameters     5      18s
db-credentials     2      25m
kube-root-ca.crt   1      7d11h

Deployment

Now that we have both ConfigMap and Secret applied, we can create a Deployment. We give it a name as server with three replicas, so there are going to be three Pods.

The .spec.strategy specifies how we replace old Pods with new ones. We did not specify one when deploying our database, so it defaults to rollingUpdate. Other strategies are recreate, blue-green, canary, and more advanced way with A/B deployments. In rollingUpdate, since we have three Pods, upon a new deploy, k8s will spin up one Pod and wait for it to be ready before replacing the next Pod. This is because maximum surge is 25%. In other words, it will only replace a rolling 25% of the Pods at a time. This also means while deployment is underway, requests will to go to either old or new pods.

Then we specify an array of containers. We only have one called ‘gmhafiz/api’. The image is also tagged to our latest git commit hash. If we want to update our deployment, this is where we edit in order to push the latest code to production!

Resources are important. Here we limit memory consumption to 128Mi and cpu cycles are limited to 500m.

Then we tell k8s that in order to check if the api is ready to receive requests, query /ready endpoint at port 3080. This is done by the .spec.containers[].readinessProbe setting.

The last two remaining parts are configuration files. Database password is loaded from ‘postgres-secret-config’ secret while all other api configurations are loaded from ‘api-parameters’ ConfigMap.

#  /k8s-devops/api/server.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: server
  name: server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: server
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
  template:
    metadata:
      labels:
        app: server
    spec:
      containers:
      - image: gmhafiz/api:6be26935ae705e3b1ff2b3e6210eb98b9182f0bf
        name: api
        resources:
          limits:
            cpu: "500m"
            memory: "128Mi"
        readinessProbe:
          httpGet:
            path: /ready
            port: 3080
          successThreshold: 3
        envFrom:
          - configMapRef:
              name: api-parameters
        env:
          - name: DB_PASS
            valueFrom:
              secretKeyRef:
                name: postgres-secret-config
                key: POSTGRES_PASSWORD

Although the key to the password is POSTGRES_PASSWORD (which is what postgres expects), we can redefine it to what our api expects to be DB_PASS.

Apply with

kubectl apply -f server.yaml

and check with kubectl get deploy and you will see three replicas are ready and available:

kubectl get deploy

# returns
NAME       READY   UP-TO-DATE   AVAILABLE   AGE
postgres   1/1     1            1           5d23h
server     3/3     3            3           6d15h

Migrate Up

Our api is now deployed and can connect to the database. However, the database has no table yet. We can SSH into the cluster and use the one-off Pod trick to connect to the database we saw in the database section and manually copy and paste our migration sql file. But this approach is imperative and you need to remember many commands with manual copying and pasting.

A more elegant solution is to run a one-off Job that runs the migrate command. So let us prepare its container image first.

# k8s-api/Dockerfile-migrate

FROM golang:1.21 AS build

WORKDIR /go/src/app/

COPY go.mod ./
COPY go.sum ./
RUN go mod download

COPY . .

RUN CGO_ENABLED=0 GOOS=linux go build -ldflags="-w -s" -o ./migrate cmd/migrate/main.go

FROM gcr.io/distroless/static-debian12:nonroot

LABEL com.example.maintainers="User <author@example.com>"

COPY --from=build /go/src/app/migrate /usr/local/bin/migrate

ENTRYPOINT ["/usr/local/bin/migrate"]

As you can see, our migration image is very similar to api’s. We use ENTRYPOINT at the end that runs the migrate binary. If only that command is supplied, it will do a default ‘up’ migration. If we supply a flag using -go, we can choose to migrate the other direction which is ‘down’.

// k8s-api/cmd/migrate/main.go

func main() {
	log.Println("starting migrate...")

	var migrateCommand string
	flag.StringVar(&migrateCommand, "go", "up", "-go up")
	flag.Parse()

	cfg := app.Config()
	db := app.DB(cfg.Database)

	migrator := app.Migrator(db)

	if migrateCommand == "up" {
		migrator.Up()
	} else if migrateCommand == "down" {
		migrator.Down()
	} else {
		log.Println("operation not supported")
	}
}

Since both api and migrate binaries share the same tag, this may be an undesirable side effect. Api tends to be updated more often which means its tag will also change often. The migrate image however has to follow suit even though there might be no change at all.

docker build -f Dockerfile-migrate -t gmhafiz/migrate:$TAG .
docker push gmhafiz/migrate:$TAG

As you can see from the code below, it also needs database credentials which are read from the environment variable. So in the Job manifest below under the ‘migrate’ container section, we put in the same ’envFrom’ and ’env’ to pull the values from Secret and ConfigMap respectively. As an api developer, they do not have to know the database password. Only you as a devops can.

# /k8s-devops/api/migrate-job.yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: migrate-job
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: migrate
          image: gmhafiz/migrate:6be26935ae705e3b1ff2b3e6210eb98b9182f0bf
          envFrom:
            - configMapRef:
                name: api-parameters
          env:
            - name: DB_PASS
              valueFrom:
                secretKeyRef:
                  name: postgres-secret-config
                  key: POSTGRES_PASSWORD

Another critical part in this Job is restartPolicy needed to be ‘Never’. The only other option is ‘OnFailure’ because the ‘Always’ option does not exist for this Job object. Upon a failure, we always want to inspect the logs rather than having k8s to retry again and again.

kubectl apply -f migrate-job.yaml 

K8s will run this job as a Pod. That means we can try and look for its logs.

kubectl get pods

#returns
NAME                        READY   STATUS      RESTARTS      AGE
migrate-8rpjn               0/1     Completed   0             38s

The migration should be quick because not a lot of data was inserted. Notice that although the run has been completed, it is not removed from the cluster. This allows us to view the logs. To see the logs, we pass in the Pod’s name, along with a flag to follow.

kubectl logs -f migrate-8rpjn

#returns
2023/09/04 22:37:25 starting migrate...
2023/09/04 22:37:25 reading env
2023/09/04 22:37:25 connecting to database... 
2023/09/04 22:37:25 database connected
2023/09/04 22:37:25 OK   20230302080119_create_randoms_table.sql (18.37ms)
2023/09/04 22:37:25 goose: successfully migrated database to version: 20230302080119

As a developer though, logs cannot be accessed.

# Run as developer

kubectl --context=dev-context logs -flu migrate-8rpjn

# returns
Error from server (Forbidden): pods "migrate-8rpjn" is forbidden: User "dev007" cannot get resource "pods/log" in API group "" in the namespace "default"

As the devops, role.yaml needs to be modified to allow developer access.

# k8s-devops/role.yaml

kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: dev
rules:
  - apiGroups: ["", "apps"]
    resources: ["deployments", "replicasets", "pods"] # we allow developers to see deployment, pods, and scale up or down
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] # You can also use ["*"]
  - apiGroups: ["", "batch"] # <-- add new rule
    resources: ["jobs", "pods/log" ]
    verbs: ["get", "list",  "create", "delete"]

A second rule is added for ‘batch’ apiGroups. This allows the developer to run (and delete) this job. Under resources, ‘pods/log’ is added which allows logs to be viewed by developers.

Once we are done viewing the logs, we can delete this Job. If you do not care about the logs, it can be deleted automatically using .spec.ttlSecondsAfterFinished documented at https://kubernetes.io/docs/concepts/workloads/controllers/job/#clean-up-finished-jobs-automatically. However, it is a good idea to collect these logs to a centralised log system like Loki, ElasticSearch or Splunk. Otherwise, it will be gone forever.

# Run as 👷 developer

kubectl --context=dev-context delete jobs/migrate

# returns
job.batch "migrate" deleted

Migrate Down

Because we useENTRYPOINT in our yaml manifest file, we can overwrite what commands and flags we want to pass to this program. To do that we need to supply a combination of both command and args. command has not changed, but we can supply an array of arguments. The one we desire is migrate -go down so we give each cli argument as an item in the args array.

# k8s-devops/api/migrate-job.yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: migrate-job
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: migrate
          image: gmhafiz/migrate:6be26935ae705e3b1ff2b3e6210eb98b9182f0bf
          command: migrate
          args:
            - -go
            - down
          envFrom:

Then simply re-apply this yaml file to migrate down one level. The same Job cannot be run twice. So if you have not deleted the ‘migrate up’ Job we ran before, it it needs to be deleted first.

kubectl delete job migrate-job # if you have not run this
kubectl apply -f migrate-job.yaml

If you the above commands, do not forget to re-run the migrate up again lest you do not have a table in the database!

Service

The final thing to make api accessible by creating a Service object for this api. It is similar to database’s Service where we restrict it to TCP.

#  /k8s-devops/api/service.yaml

apiVersion: v1
kind: Service
metadata:
  labels:
    app: server
  name: server
spec:
  type: ClusterIP
  ports:
  - port: 3080
    protocol: TCP
    targetPort: 3080
  selector:
    app: server

Apply with

kubectl apply -f service.yaml

Confirm ‘server’ Service has been created and exposed at port 3080 with

kubectl get svc

#returns
NAME         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
kubernetes   ClusterIP   10.96.0.1       <none>        443/TCP    17d
postgres     ClusterIP   10.105.21.101   <none>        5432/TCP   7d2h
server       ClusterIP   10.101.141.71   <none>        3080/TCP   139m

Access

Just like a database, let us try to SSH into one of the nodes and try hitting the /ready endpoint to confirm that the api can talk to the database.

ssh kubeadmin@172.16.16.201 # ip address of your node, and it depends on your setup

Then using the ClusterIP address we saw above,

curl -v 10.101.141.71:3080/ready

# returns
*   Trying 10.101.141.71:3080...
* Connected to 10.101.141.71 (10.101.141.71) port 3080 (#0)
> GET /ready HTTP/1.1
> Host: 10.101.141.71:3080
> User-Agent: curl/7.88.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< Access-Control-Allow-Headers: Accept, Content-Type, Content-Length, Authorization
< Access-Control-Allow-Methods: POST, GET, OPTIONS, PUT, DELETE
< Access-Control-Allow-Origin: *
< Date: Sun, 09 Jul 2023 10:26:57 GMT
< Content-Length: 0
< 
* Connection #0 to host 10.101.141.71 left intact

It returns a 200 OK response which means the api can talk to the database! Notice that we use an IP address once we SSH into one of the nodes. The Pods themselves have an entry in their /etc/resolv.conf file to help the correct IP address when referencing using postgres host.

Inspect DNS

Unfortunately we cannot inspect the content of this /etc/resolv.conf file because our container’s base image is bare minimum — there’s no shell, cat, tail, nano, etc. What we can do is to use another base image that contain the necessary tools for debugging. Rebuild a new image using a ’normal’ base such as debian:12 instead of distroless.

Use debian:12 image and install curl (optionally netcat too)

#FROM gcr.io/distroless/static-debian11
FROM debian:12

RUN apt update && apt install -y curl netcat-traditional

Push the image and apply to your cluster. Grab any one of the newly created Pod names.

kubectl get po

# returns
NAME                        READY   STATUS    RESTARTS      AGE
postgres-867f96c9d8-wd9xh   1/1     Running   2 (76m ago)   6d
server-6b89f77d6c-r5ndr     1/1     Running   0             119s
server-6b89f77d6c-t5466     1/1     Running   0             2m19s
server-6b89f77d6c-xnsrk     1/1     Running   0             2m32s
web-7b84b7d46f-kxhgj        1/1     Running   4 (76m ago)   10d

Inspect the Pod’s /etc/resolv.conf file. The -- passes the command we want to the Pod. As you can see, k8s added the entry for us automatically.

kubectl exec -it server-6b89f77d6c-xnsrk -- cat /etc/resolv.conf

# returns
search default.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.96.0.10
options ndots:5

and now, you can query /ready endpoint using its DNS entry instead of IP address.

kubectl exec -it server-6b89f77d6c-xnsrk -- curl -v server.default.svc.cluster.local:3080/ready

# returns
*   Trying 10.101.141.71:3080...
* Connected to server.default.svc.cluster.local (10.101.141.71) port 3080 (#0)
> GET /ready HTTP/1.1
> Host: server.default.svc.cluster.local:3080
> User-Agent: curl/7.88.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< Access-Control-Allow-Headers: Accept, Content-Type, Content-Length, Authorization
< Access-Control-Allow-Methods: POST, GET, OPTIONS, PUT, DELETE
< Access-Control-Allow-Origin: *
< Date: Fri, 21 Jul 2023 08:55:27 GMT
< Content-Length: 0
< 
* Connection #0 to host server.default.svc.cluster.local left intact

As you can see, the content of /etc/resolv.conf are different from the nodes’ and the Pod’s. Pod’s DNS entry is managed by k8s while nodes’ are left untouched. Not so magical after all once you figure out what k8s did behind the scene.

diagram showing difference of using kubectl exec -it to get /etc/resolv.conf file in a Pod versus /etc/resolv.conf of a node

Frontend

To visualise and interact with the api, we will build a simple single-page application (SPA). Frontend is not the main focus in this post, so I will only share important snippets only. Full code is in https://github.com/gmhafiz/k8s-web.

The web is created using the wizard found in the Vue3 documentation site.

npm init vue@latest

We need a view to show our list of ten UUID, so we create a new Vue component under the Views directory. In it, we need a method that calls the api and stores the result in data(). The URL needs to be the same as the one we will port-forward the api. To automatically call this get() method, we call it when the page gets mounted.

interface Random {
    id: number,
    name: string
}

export default defineComponent({
    mounted() {
        this.get()
    },
    data() {
        return {
            randoms: [] as Random[],

            address: 'http://localhost:3080'
        }
    },
    methods: {
        async get() {
              this.randoms = await fetch(`${this.address}/randoms`)
                  .then((res => res.json()))
                  .catch((err) => {
                    console.error(err)
                  });
            },

To display, we simply loop over randoms variable. Because this is Typescript, we declare that it is going to be a Random data structure where each item has an ID and a name. We can display them using the dot notation.

<div v-for="item in randoms" :key="item.id" style="display: flex; font-family: monospace">
    <div style="min-width: 1.5rem;">
        {{ item.id }}
    </div>
    <div>
        : {{ item.name }}
    </div>
</div>

We also want a button that calls the api to add a new UUID just to show that connections can also go in. So we create a button so that on a click (@click), it calls the add() method with POST verb. On success, it will re-retrieve the latest ten UUIDs. Vue will take care of updating the UI, and it will show the latest data automatically. That’s right, all DOM manipulations are done magically. Of course, you can also make the POST request to return only the single created data and then append its result to the randoms variable instead of re-querying the list again.

<script lang="ts">
  methods: {
    add() {
      fetch(`${this.address}/randoms`, {
        method: "POST"
      })
          .then(() => {
            this.get()
          })
          .catch((err) => {
            console.error(err)
          })
    }
  }
</script>

<template>
    <div style="margin-top: 1rem; margin-bottom: 1rem;">
        <button
            style="padding: 0.5rem; background-color: rgba(72,154,238,0.79); color: rgb(255,255,255,0.86)"
            @click="add"
        >
          Add New UUID
        </button>
    </div>
</template>

Finally, we need to containerise this. Unlike our api program, this web UI needs a server that can serve this page because a folder on its own cannot serve itself. Something like nginx or caddy suffice. We continue to use multi-stage strategy like we have seen with api container build. Starting from the node image of version 18 (LTS version), we copy both package.json and its lock file (package-lock.json) and run npm install. In the second stage, we run nginx to serve static files from /usr/share/nginx/html. It is exposed at port 8080 which is the default port for this chainguard nginx image.

# k8s-web/Dockerfile

FROM node:18 as build-stage

WORKDIR /app

COPY package*.json ./
RUN npm install

COPY . .
RUN npm run build

FROM cgr.dev/chainguard/nginx:1.25.1 as production-stage

COPY --from=build-stage /app/dist /usr/share/nginx/html

EXPOSE 8080

To build, run the following command. Just replace gmhafiz/web with your own repository name if you are following along.

git commit -a -m "initial commit"
TAG=$(git rev-parse HEAD)
docker build -t gmhafiz/web:${TAG} .

Finally, push the resulting image to Docker hub.

docker push gmhafiz/web:${TAG} 

Now as the devops, we only need to create two objects; a Deployment and a Service. Here we combine both Deployment and Service object in a single yaml file because these objects are logically grouped together (using --- as separator) and is recommended as a best practice. You can keep all db.yaml, api.yaml, and web.yaml in a single folder without worrying about file naming clashes, and finding them becomes a lot easier.

# k8s-devops/web/web.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: web
  name: web
spec:
  replicas: 1
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - image: gmhafiz/web:e2adcb710ce2feb645b2a5c3799d67e6d9554631
          name: web
          resources:
            limits:
              cpu: "500m"
              memory: "128Mi"

Two important bits are as follows. The first id .spec.template.metadata.labels.app which we labelled as web This is the label that our Service is going to refer to. Finally, we have a single container with the tagged image name.

To expose this Deployment, we need a Service object. The image exposes the web application at port 80 denoted by .spec.ports[0].port. We use targetPort for us to find this Deployment which happens to be the same number. .spec.selector.app is important because that is how this Service is going to find our Deployment above.

# k8s-devops/web/web.yaml

---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: web
  name: web
spec:
  type: ClusterIP
  ports:
    - port: 80
      protocol: TCP
      targetPort: 80
  selector:
    app: web

Apply with

kubectl apply -f web.yaml

Port Forward

This brings us to the last piece of this post which is port-forwarding. This technique forwards requests we make from the local computer into the k8s cluster. We only need to port-forward both api and the web. It is not necessary to port-forward the database because we do not interact directly with it. While this port-forward technique is sufficient right now, in a real-world application, you would want an Ingress instead.

To port-forward the api, we use port-forward sub-command followed by Service name. The last part of port number mapping is a bit unintuitive - it follows a <local>:<k8s> format meaning we want to map port 8080 exposed from k8s web Service to port 8080 in our local computer.

At this stage you should have two Services running in the cluster, one each for api and server.

kubectl get svc

#returns
NAME         TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
kubernetes   ClusterIP   10.96.0.1        <none>        443/TCP    18d
postgres     ClusterIP   10.105.21.101    <none>        5432/TCP   8d
server       ClusterIP   10.101.141.71    <none>        3080/TCP   30h
web          ClusterIP   10.109.238.205   <none>        8080/TCP   50s

To port forward run each of the following command in its own terminal

kubectl port-forward svc/web 8080:8080

# returns
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080
kubectl port-forward svc/server 3080:3080

# returns
Forwarding from 127.0.0.1:3080 -> 3080
Forwarding from [::1]:3080 -> 3080

Now open your browser at http://localhost:8080, and you will see a Vue single-page application with ten UUIDs on the right!

list UUIDS

If you look over the network tab, you can see this SPA sends a request to localhost port 3080. This request is then forwarded into the k8s cluster.

network tab

To reiterate that the api can communicate with database, let us add a new UUID by clicking the Add button:

POST request

A POST request was made and it re-retrieves the latest ten UUIDs which includes a new 101-th UUID.

updated list

CORS

Before we finish, let us make an adjustment to the api, so it only accepts api requests from http://localhost:8080 In the k8s-api repository, open cmd/api/main.go and restrict allowed incoming requests from everything using asterisk (*) to http://localhost:8080.

@@ -57,7 +58,7 @@
 }
 
 func enableCors(w *http.ResponseWriter) {
-	(*w).Header().Set("Access-Control-Allow-Origin", "*")
+	(*w).Header().Set("Access-Control-Allow-Origin", "http://localhost:8080")
 	(*w).Header().Set("Access-Control-Allow-Methods", "POST, GET, OPTIONS, PUT, DELETE")
 	(*w).Header().Set("Access-Control-Allow-Headers", "Accept, Content-Type, Content-Length, Authorization")
 }

Whenever there is a mismatch between originating URL and api server, the browser will attach additional headers and sends an OPTION HTTP request, and if successful, it will continue with the intended API call. Our server already returns those three headers anyway to help browser to decide whether to continue doing the API call or not.

Commit and build a new image. Then push to container registry. There will be a new $TAG so this value needs to be applied to container image and yaml file.

git commit -a -m "cors restriction"
TAG=$(git rev-parse HEAD)
docker build -f Dockerfile-api -t gmhafiz/api:$TAG .
docker push gmhafiz/api:${TAG}

In the devops repository, edit server’s image tag, so it will deploy our new image

vim k8s-devops/api/server.yaml
    spec:
      containers:
      - image: gmhafiz/api:7f7bec15ce619a43a2ecb75dc3d56673322174f5 # <-- different tag

Simply apply this file and k8s will do its magic of rolling out new Deployment in the k8s cluster.

kubectl apply -f server.yaml

Once that is done, you will see everything still work the same.

To confirm that CORS is working, let us change the port our local port forward is listening to something other than 8080. Press Ctrl+C to terminate web’s port forward and pick another port, for example 8081.

kubectl port-forward svc/web 8080:8080

Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080
Handling connection for 8080
Handling connection for 8080
^C  

kubectl port-forward svc/web 8081:8080

Forwarding from 127.0.0.1:8081 -> 8080
Forwarding from [::1]:8081 -> 8080

Instead of http://localhost:8080, open http://localhost:8081 and you will see no api requests are working anymore! Open network tab and you will see a CORS error.

CORS error because origin is not in the allowed list

Further investigation shows that the browser sends ‘Origin: http://localhost:8081’ which is not in the allowed origin set in CORS settings in our api.

CORS origin is http://localhost:8081 which is not allowed

Calling the api using curl will continue to work. CORS only applies between browsers and servers because your browser first sends an OPTION HTTP request before doing a GET HTTP verb.

Conclusion

You have learned how to create three applications, database, api server, and a web single-page application and containerise and push each of them to a container registry. Database was deployed with extra authorizations steps. The API server was deployed in a more cloud-native way by following many of 12-factor app principles. Then, using k8s yaml manifest files, we deploy each of them into the cluster and expose them as service. Finally, port-forwarding allows us to access both api and web applications.

As you can see, a handful of yaml files were enough to deploy all three applications. Yaml files can be logically combined which makes it only three yaml files to manage. SealedSecret object needed an extra step to create though.

What is next? In real-life production settings, it is a very good idea to use a managed database solution instead of having the database in the cluster like this. There’s no TLS anywhere in between the applications so it is not secure thus that needs to be remediated. Finally, the applications we have here cannot be accessed from the Internet yet. That is what the next part is for which is Ingress which I hope will be the shortest of this three-part blog series ;)

References

Further Reads

Highly Available Kubernetes Cluster

Create a highly available kubernetes cluster v1.26 using libvirt, VMs deployed by vagrant with 3 control planes and 3 worker nodes on Debian 11

a photo of cluster of networks, ships with containers, in the clouds, trending on artstation, by ross tran Kubernetes… everyone’s favourite container orchestration. Love or hate it, it is fast becoming the de-facto solution for managing containers at scale. This blog post will go through step by step on how we can create a kubernetes (k8s) cluster on a local machine from the first command to all nodes being ready in about 12 minutes. This post assumes the readers to have basic knowledge of k8s. If you haven’t heard of k8s, you should read this.

Before we jump into the guide, I will outline the tools and the kind of k8s cluster we are going to create. We know that creating a k8s cluster is complex. kubeadm has made the process easier but there are still steps that feel like a maze. Do you not need docker with k8s anymore? What are the various container runtimes? Why are there so many choices for pod networking? What to do with ‘cgroup driver’? This blog post intends to go through each step and be a complete guide into creating a highly available kubernetes cluster.

The plan is to create a series of blog posts to create a production-grade k8s cluster. The first step is to create an HA cluster. This is important because we do not want to have any single point of failure in our architecture. At the end of this post, we will deploy a vanilla nginx web server to demonstrate that we can successfully access its pod with the correct port number.

Second part of the series is going to be deploying more complex applications. A popular architecture for a web application is having a single-page application (SPA) frontend talking to an api backend on a specific port. We will see how we can configure both k8s and our applications to achieve this. We will throw in observability which includes prometheus, jaeger, and loki as well as grafana for visualisation.

The final part is going to be about ingress controller, so we can access our application in local machine from the internet. We will need a domain name for this, and we will configure the DNS to point to our k8s cluster and see how our virtual IP (floating IP) from keepalived plays into this. To make our cluster more production-grade, we also look into backup and restore etcd database.

This guide on the first part does not come out of vacuum. I heavily referred to this video https://youtu.be/SueeqeioyKY and highly available (HA) section of k8s at https://github.com/kubernetes/kubeadm/blob/main/docs/ha-considerations.md#options-for-software-load-balancing. Full list of reference is at https://github.com/gmhafiz/k8s-ha#reference.

Why is k8s short for kubernetes? There are 8 letters in between of ‘k’ and ’s’   😎

If you do not want to read the rest, tldr:

git clone https://github.com/gmhafiz/k8s-ha
cd k8s-ha
make install 
make up
make cluster

This is part of a series of blog posts including:


The runnable code is available in the repository https://github.com/gmhafiz/k8s-ha

Preface

There are many third party tools you can use to create a local k8s cluster locally such as minikube, kind, k3s, k0s among others. Minikube only create a single node, so it is hard to show how we can create a highly available cluster. Kind can create a multi-node cluster which is great, and it is a certified installer, just like k3s and k0s which I have not played around with yet. The method I am choosing is kubeadm because it is easy-ish to use with the right amount of magic and transparency. To provision the nodes (virtual machines) I use vagrant with libvirt. Provisioning all eight nodes is fast. Subtracting the time to download the images, it takes about a minute to get all VMs up.

To create a k8s cluster, there are a series of commands that need to be run on some, or all nodes. We could copy and paste each command to relevant VMs but there is a simpler way. Ansible allows us to declare the desired state that we want to be executed on each VM. With a single command, it can execute the same command to the nodes we tell it to.

If you want to administer the cluster from a local machine, you can install kubectl. Once the cluster is installed, the config file can be copied into your $HOME/.kube/config.

Tools we are going to use:

K8s spec:

vagrant is our VM management workflow. It uses a declarative Vagrantfile (in Ruby language) that defines the desired states of our VMs. Creating the VMs is just a simple vagrant up command. There is a caveat in the way networking is done though. The IP address we are concerned with is at eth1, not eth0 found in aws ec2 or digitalocean when provisioned using terraform. I am not sure why this is the case. It may have something to do with the way vagrantbox is packaged. You can absolutely skip this tool and use libvirt directly to create VMs, but I like having it in a file where it can be versioned.

kvm or ‘Kernel-Based virtual machine’ is a module in Linux kernel to allow the kernel to function as a hypervisor - basically allow us to run hardware-accelerated virtual machines.

libvirt is an interface for all virtualisation technology. It is the one that manages VMs like kvm, qemu and others. It is essentially the one that creates and destroys VMs. It is possible to use virtualbox instead by adding --provider=virtualbox to vagrant up command but libvirt makes faster VMs than virtualbox.

ansible can send shell commands to each of our VMs simultaneously as long as it can SSH into each of them. So, we save on typing and copy-pasting. The commands are defined in declarative way into files called ‘playbooks’, just like the ‘Vagrantfile`. Not all the playbooks are idempotent yet which is one thing I would like to fix.

virtual IP from keepalived is used on top of the two haproxy load balancers. In case one load balancer goes down, our floating IP will point to our backup load balancer. Since we are installing the cluster manually, we need to create such floating IP. In managed solution like eks, gke, or aks, they will provide an external load balancer for you if you have chosen an HA setup.

k8s latest version as of now (2022-12-11) is version 1.26.0 Biggest change in recent years is removal of dockershim in v1.24 which means it is advisable to use containerd directly or use other container runtime like CRI-O.

kubeadm is an easy-to-use tool to create a k8s cluster. The official guide is pretty shallow which is why this blog series was created. It simplifies the creation of a k8s cluster, and you do not have to deal with tedious stuff like certificate creation, SSL stuff, firewall settings, configuring the network, creating etcd database and many others. Other choices to create a k8s cluster has been mentioned above, k3s, kind, minikube, and of course Kelsey Hightower’s Kubernetes The Hard Way

Debian 11 Bullseye is the distro of choice. I like stable and Long-term Support (LTS) distro like Debian and Ubuntu because you get security updates for ~5 years. There’s nothing wrong with choosing Ubuntu but Debian is seen to be more stable and would have a smaller surface of attack. Alternatives include distros like Rocky Linux, and AlmaLinux that has a ten-year security support. There are options for immutable OS like Talos, and self-updating like Fedora CoreOs too. Ultimately, the base operating system should not matter much.

containerd is the container runtime interface (CRI) of choice. It is responsible with creating and destroying pods (think of a pod as a container for Docker is for k8s). It is simple and works reliably. Another choice as mentioned before is CRI-O. One thing that needs to be checked is the compatibility between the installed containerd and kubelet. For example, trying k8s version 1.25.4 was fine until I tried with k8s version 1.26 - in which the containerd package that comes with debian is too old and is unsupported.

calico is one of many choices for a container network interface (CNI). It is responsible for allocating pod IP addresses. During cluster initialisation, we need to supply ‘Classless Inter-Domain Routing’ (CIDR) for the pods which differ from one CNI to another. Other choices are Flannel, weave net, cilium, etc. There isn’t any particular specific reason why I am using calico other than it is the most popular.

As you can see, we use five tools not directly related to k8s (vagrant/libvirt/ansible/haproxy/keepalived). While the fewer tools are better to help reduce learning friction (and fewer things to break!), these tools help us to achieve what we want. Vagrant isn’t strictly needed, but it is good to have your node topology versioned. If you had gone for a managed HA k8s provider, then you do not need both haproxy and keepalived. If you use cloud VMs, then you do not need both vagrant and libvirt since you no longer need to create VMs or have bare-metal machines. Ansible is still going to be greatly beneficial in either cases.

What is not covered

This post is long, but it does not cover everything you need to know about creating an HA k8s cluster.

Topology

Highly available k8s cluster means that there isn’t any single point of failure. Before we take a look at our architecture, consider the following topology from official docs,

https://d33wubrfki0l68.cloudfront.net/d1411cded83856552f37911eb4522d9887ca4e83/b94b2/images/kubeadm/kubeadm-ha-topology-stacked-etcd.svg

There are three control plane nodes which ensures that if one of the nodes go down, there are other two control plane nodes to take care of all k8s operations. An odd number is chosen to make leader election (of the raft consensus algorithm) easier.

The same goes with worker nodes. There are at least three nodes, and they have an odd number, just like the control plane. However, that page does not elaborate on the load balancer. If that load balancer node fails, outside traffic cannot access anything on the cluster. For that, we need to refer to https://github.com/kubernetes/kubeadm/blob/main/docs/ha-considerations.md#options-for-software-load-balancing.

In our chosen architecture, instead of a single load balancer, we create three of them - just like control plane nodes. There are a number of options for a load balancer including nginx, traefik, or even an API gateway. I chose haproxy because it is not only battle-tested, but it is small and consumes minimal resources.

As a result, this is the topology I have chosen

kubernetes HA topology

Etcd is still hosted on each control plane node like before. But now, there are two load balancers that proxies requests to one of the three apiservers contained in each control plane. If one load balancer goes down, it will switch over to the second load balancer. Notice that I used a virtual IP (floating IP) as the entry point created from keepalived. This floating IP will point to which ever healthy node (Master), and the other node (Backup) serves as the fail-over node. Finally, each worker node communicates with the virtual IP.

The specs of each node is as follows. 2GB Ram and 2 CPU for each k8s node is the minimum required for k8s. Haproxy and keepalived are pretty light so I only give 512MB and 1 CPU to each load balancer nodes. Haproxy website gives a recommendation on server spec if you are interested in customising this.

Role Host Name IP OS RAM CPU
Load Balancer loadbalancer1 172.16.16.51 Debian 11 Bullseye 512MB 1
Load Balancer loadbalancer1 172.16.16.52 Debian 11 Bullseye 512MB 1
Control Plane kcontrolplane1 172.16.16.101 Debian 11 Bullseye 2G 2
Control Plane kcontrolplane2 172.16.16.102 Debian 11 Bullseye 2G 2
Control Plane kcontrolplane3 172.16.16.103 Debian 11 Bullseye 2G 2
Worker kworker1 172.16.16.201 Debian 11 Bullseye 2G 2
Worker kworker2 172.16.16.202 Debian 11 Bullseye 2G 2
Worker kworker3 172.16.16.203 Debian 11 Bullseye 2G 2

Preparation

This guide assumes you haven’t installed any of the tools yet. If you have, skip to Create K8s Cluster section.

Before going any further, we need to check if your CPU support hardware-accelerated virtualization for the purpose of creating VMs. On a debian-based distro, install libvirt.

sudo apt update && sudo apt upgrade
sudo apt install bridge-utils qemu-kvm virtinst libvirt-dev libvirt-daemon virt-manager

To check if your CPU supports hardware-accelerated virtualization, run the kvm-ok command. Success message looks like this

$ kvm-ok
INFO: /dev/kvm exists
KVM acceleration can be used

Next, install vagrant

wget -O- https://apt.releases.hashicorp.com/gpg | gpg --dearmor | sudo tee /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install vagrant
vagrant plugin install vagrant-libvirt vagrant-disksize vagrant-vbguest

The vagrant-libvirt plugin is key to allow us to create VMs using libvirt. It is possible to use virtualbox as well, so I included that vagrant-vbguest plugin too,

To manage your k8s cluster on a local machine, you need to install kubectl. It is highly recommended that you use the same version for both kubectl and kubelet. So when you are upgrading k8s in the VMs, upgrade kubectl on your local machine with the same version too.

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

Clone the repository with

git clone https://github.com/gmhafiz/k8s-ha

Finally, create a public-private key pair exclusive for accessing the VMs using SSH keys. Accessing this way should be favoured over accessing using username/password nowadays.

ssh-keygen -t rsa -b 4096 -f ansible/vagrant
chmod 600 ansible/vagrant
chmod 644 ansible/vagrant.pub

It is possible to use your own public key if you have it. Simple change all occurrence in Vagrantfile, to the path of your ~/.ssh/id_rsa.pub. For example,

ssh_pub_key = File.readlines("./ansible/vagrant.pub").first.strip
# to
ssh_pub_key = File.readlines("~/.ssh/id_rsa.pub").first.strip

Create K8s Cluster

Provision VMs

Now the fun part begins. To provision the VMs, simply run

vagrant up

link to file https://github.com/gmhafiz/k8s-ha/blob/master/Vagrantfile

If you are running for the first time, it will download a linux image called vagrantbox. If you look into Vagrantfile, we use Debian 11 Bullseye with box version 11.20220912.1. Note that the sudo password is set to kubeadmin which is set in bootstrap.sh file.

VAGRANT_BOX               = "debian/bullseye64"
VAGRANT_BOX_VERSION       = "11.20220912.1"

You can also use Ubuntu 20.04, but I haven’t managed to get Ubuntu 22.04 to work.

VAGRANT_BOX               = "generic/ubuntu2004"
VAGRANT_BOX_VERSION       = "3.3.0"

We also define how many CPUs and how much RAM

CPUS_LB_NODE              = 1
CPUS_CONTROL_PLANE_NODE   = 2
CPUS_WORKER_NODE          = 2
MEMORY_LB_NODE            = 512
MEMORY_CONTROL_PLANE_NODE = 2048
MEMORY_WORKER_NODE        = 2048

For a minimum HA,

LOAD_BALANCER_COUNT = 2
CONTROL_PLANE_COUNT = 3
WORKER_COUNT        = 3

K8s Cluster

Ansible is our choice to send commands to our selected nodes. Change to folder,

cd ansible

Firstly, we take a look at provisioning our VMs. We define the details and grouping of the VMs using a hosts file.

[load_balancer]
lb1 ansible_host=172.16.16.51 new_hostname=loadbalancer1
lb2 ansible_host=172.16.16.52 new_hostname=loadbalancer2

[control_plane]
control1 ansible_host=172.16.16.101 new_hostname=kcontrolplane1
control2 ansible_host=172.16.16.102 new_hostname=kcontrolplane2
control3 ansible_host=172.16.16.103 new_hostname=kcontrolplane3

[workers]
worker1 ansible_host=172.16.16.201 new_hostname=kworker1
worker2 ansible_host=172.16.16.202 new_hostname=kworker2
worker3 ansible_host=172.16.16.203 new_hostname=kworker3

[all:vars]
ansible_python_interpreter=/usr/bin/python3

By default, this file is located in the same folder as all playbooks. We can modify this if you want by using an -i flag and specifying the location. We have set several variables in vars.yaml, so we refer to it by using the --extra-vars option.

At this stage, you can create the HA cluster with one command. However, it is good to go through each step of the way and explain what happens and the decisions I have made.

# yolo and run all playbooks
ansible-playbook -u root --key-file "vagrant" main.yaml --extra-vars "@vars.yaml"

If you decide against the single command and wants to learn step by step (there are a total of 7 playbooks), strap your seat belt because we are going to take a long and exciting ride!

If you have a team member that you want to give access to the VMs, this is an easy way to create a user and copy the user’s public key to all VMs.

# 01-initial.yaml
- hosts: all
  become: yes
  tasks:      
    - name: create a user
      user: name=kubeadmin append=yes state=present groups=sudo createhome=yes shell=/bin/bash password={{ PASSWORD_KUBEADMIN | password_hash('sha512') }}

    - name: set up authorized keys for the user
      authorized_key: user=kubeadmin key="{{item}}"
      with_file:
        - ~/.ssh/id_rsa.pub

The ansible playbook creates a user named kubeadmin and copies the public key to all VMs! To add more users, simply copy and paste both tasks with a different user and public key path.

For now, this user only belongs to the group sudo. Just make sure that this particular user is allowed sudo access! Notice that a password is also set against this user.

I find myself typing a lot of kubectl command, so I like to alias it to something shorter like k, for example k get no is short for kubectl get nodes.

- hosts: control_plane
  become: yes
  tasks:
    - name: Add kubectl alias for user
      lineinfile:
        path=/home/kubeadmin/.bashrc
        line="alias k='kubectl'"
        owner=kubeadmin
        regexp='^alias k='kubectl'$'
        state=present
        insertafter=EOF
        create=True

    - name: Source .bashrc
      shell: "source /home/kubeadmin/.bashrc"
      args:
        executable: /bin/bash

Run the playbook with the following command. Make sure you are in the ./ansible directory.

# step 1
ansible-playbook -u root --key-file "vagrant" 01-initial.yaml --extra-vars "@vars.yaml"

Next, we need to install essential packages to the nodes.

ansible-playbook -u root --key-file "vagrant" 02-packages.yaml

We always want to keep our VMs up to date. So we run the necessary apt commands.

Thanks to ansible, we can install both keepalived and haproxy on each of the two load balancer nodes, as well as apt-transport-https to all eight nodes in a single command.

# 02-packages.yaml
- hosts: all
  become: yes
  tasks:
    - name: update repo
      apt: update_cache=yes force_apt_get=yes cache_valid_time=3600
    - name: upgrade packages
      apt: upgrade=dist force_apt_get=yes
      
    - name: packages
      apt:
        name:
          - apt-transport-https
        state: present
        update_cache: true

- hosts: load_balancer
  become: yes
  tasks:
    - name: packages
      apt:
        name:
          - keepalived
          - haproxy
        state: present
        update_cache: true

Next is installing the load balancer, haproxy.

Several things happen here. We set the config file at /etc/haproxy/haproxy.cfg. It has four sections, global, defaults, frontend, and backend. Comprehensive explanation is detailed in haproxy’s website but the most important sections for us are both the frontend and backend. The frontend section is the entry point and notice that the port is 6443 which is the same as the default apiserver port. Since these load balancers are installed in a separate node, this is ok. If you install the load balancer in the same nodes as control planes, you need to change this port into something else.

# 03-lb.yaml
- hosts: load_balancer
  become: yes
  tasks:
    - name: haproxy
      copy:
        dest: "/etc/haproxy/haproxy.cfg"
        content: |
          global
            maxconn 50000
            log /dev/log local0
            user haproxy
            group haproxy

          defaults
            log global
            timeout connect 10s
            timeout client 30s
            timeout server 30s

          frontend kubernetes-frontend
            bind *:6443
            mode tcp
            option tcplog
            default_backend kubernetes-backend

          backend kubernetes-backend
            option httpchk GET /healthz
            http-check expect status 200
            mode tcp
            option check-ssl
            balance roundrobin
              server kcontrolplane1 {{ IP_HOST_CP1 }}:6443 check
              server kcontrolplane2 {{ IP_HOST_CP2 }}:6443 check
              server kcontrolplane3 {{ IP_HOST_CP3 }}:6443 check          

The mode tcp is telling us that this is a Layer 4 (transport) of OSI model load balancer. We use a simple round-robin strategy to distribute the requests among the three control planes. Note that the IP addresses are in double moustache variables. The port needs to be 6443 because we have no intention to change this default port.

Next, we are pinging /healthz which is the default endpoint of kubernetes’ ‘healthiness’ and simply check for 200 HTTP status.

# vars.yaml
...
IP_HOST_CP1: 172.16.16.101
IP_HOST_CP2: 172.16.16.102
IP_HOST_CP3: 172.16.16.103
...

From haproxy website, there is a guide to calculate how many connections you could set based on available RAM. The timeouts are obviously based on your business case.

With this config file, we can verify if it is valid with the command

haproxy -f /etc/haproxy/haproxy.cfg -c

and it should return

Configuration file is valid

Next we take a look at keepalived configuration, specifically the vrrp_script section. Firstly, we define our virtual IP at 172.16.16.100 under the virtual_ipaddress section. There is a path to a script that can tell the ‘aliveness’ of haproxy. This is defined under track_script section which points to our /etc/keepalived/check_apiserver.sh script. It is going to check the health of this load balancer every 3 seconds. If it fails 5 times , it will mark this node to be down. When it does, it will bring the priority from 100 down by 2 as defined in vrrp_instance section. This means the BACKUP load balancer will have a higher priority (100 > 98) and that backup becomes the master. Meanwhile, if the node returns ok twice (set by rise = 2), it will bring the priority up by two.

vrrp_script check_apiserver {
    script "/etc/keepalived/check_apiserver.sh"
    interval 3
    timeout 10
    fall 5
    rise 2
    weight 2
}
    
vrrp_instance VI_1 {
    state BACKUP
    interface eth1
    virtual_router_id 1
    priority 100
    advert_int 5
    authentication {
        auth_type PASS
        auth_pass {{ PASSWORD_KEEPALIVED }}
    }
    virtual_ipaddress {
        172.16.16.100
    }
    track_script {
        check_apiserver
    }
}

Looking at the vrrp_instance section of keepalived, we have a state called BACKUP, the other being a MASTER. Surprisingly, both load balancer nodes can have the exact same configuration file. If they do, they will randomly choose one node to become a MASTER. However, we aren’t going to do that. We will explicitly set distinct config file for each load balancer.

The interface has to be eth1 as explained before. So always check with ip a command.

This ‘aliveness’ script is adapted from k8s website. What we needed to modify a couple of things. {{ VIRTUAL_IP }} is going to be 172.16.16.100. Then the port of haproxy is defined by {{ K8S_API_SERVER_PORT }} and it is going to be 6443. If you decide that you want to install both haproxy and keepalived on each control plane nodes, then port number will clash with k8s’s apiserver default port. So in the haproxy config above, you must change the port 6443 to something else. Since we have the two load balancers on a separate nodes than control plane, we do not have to worry about this.

#!/bin/sh

errorExit() {
  echo "*** $*" 1>&2
  exit 1
}

curl --silent --max-time 2 --insecure https://localhost:{{ K8S_API_SERVER_PORT }}/ -o /dev/null \
 || errorExit "Error GET https://localhost:{{ K8S_API_SERVER_PORT }}/"
if ip addr | grep -q {{ VIRTUAL_IP }}; then
  curl --silent --max-time 2 --insecure https://{{ VIRTUAL_IP }}:{{ K8S_API_SERVER_PORT }}/ -o /dev/null \
  || errorExit "Error GET https://{{ VIRTUAL_IP }}:{{ K8S_API_SERVER_PORT }}/"
fi

The two keepalived communicates using basic auth. And I generate its password with

openssl rand -base64 32

The other two aren’t so important. For virtual_router_id, we can just pick any arbitrary number from 0 to 255 and make it identical because both are members of the same cluster. Each node advertises at a 5 seconds interval with advert_int.

Run with

ansible-playbook -u root --key-file "vagrant" 03-lb.yaml --extra-vars "@vars.yaml"

At the moment, we have no connection to any of the control plane k8s because we have not initialised it yet. However, we can check if keepalived is working using a netcat command.

nc -v 172.16.16.100 6443

Now installing k8s itself involves several complex steps. Since we provisioned the VMs using vagrant, there is a caveat that needs to be taken care of. The IP address we are concerned with is actually at interface eth1 instead of eth0. This is different from when I played around with terraform to provision VMs in AWS ec2 instances because its default interface is at eth0. If you are using bare metal, the interface again could be different. So always check by first SSH-ing and run the ip a command.

ssh -i ./vagrant vagrant@172.16.16.101
ip a

It will look something like this

$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:e7:bf:01 brd ff:ff:ff:ff:ff:ff
    inet 192.168.121.190/24 brd 192.168.121.255 scope global dynamic eth0
       valid_lft 2492sec preferred_lft 2492sec
    inet6 fe80::5054:ff:fee7:bf01/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:a4:cc:84 brd ff:ff:ff:ff:ff:ff
--> inet 172.16.16.51/24 brd 172.16.16.255 scope global eth1 <-- the IP address we defined is at eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fea4:cc84/64 scope link 
       valid_lft forever preferred_lft forever

Once we’ve installed keepalived, we will be able to see the virtual IP on one of the two load balancer nodes.

kubeadmin@loadbalancer2:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:52:d1:1d brd ff:ff:ff:ff:ff:ff
    inet 192.168.121.173/24 brd 192.168.121.255 scope global dynamic eth0
       valid_lft 3416sec preferred_lft 3416sec
    inet6 fe80::5054:ff:fe52:d11d/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:1d:3f:8c brd ff:ff:ff:ff:ff:ff
    inet 172.16.16.52/24 brd 172.16.16.255 scope global eth1
       valid_lft forever preferred_lft forever
--> inet 172.16.16.100/32 scope global eth1 <-- our floating IP appears
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe1d:3f8c/64 scope link 
       valid_lft forever preferred_lft forever

The floating IP is assigned to loadbalancer1 because it has a higher initial priority. It can take some time for this floating IP to show up, so hopefully it is ready before we initialise k8s. If in doubt, check using journalctl command on each load balancer node.

sudo journalctl -flu keepalived

Moving forward with k8s cluster installation, it has several server requirements. Swap must be disabled because k8s cannot function with it. We remove its /etc/fstab entry too so that it will not be switched on at reboot.

# 04-k8s.yaml
- hosts: control_plane,workers
  become: yes
  tasks:
    - name: Disable swap
      become: yes
      shell: swapoff -a
      
    - name: Disable SWAP in fstab since kubernetes can't work with swap enabled
      replace:
        path: /etc/fstab
        regexp: '^([^#].*?\sswap\s+sw\s+.*)$'
        replace: '# \1'
...

kubeadm will manage k8s firewall configuration so we disable them

...
    - name: Disable firewall
      shell: |
                systemctl disable --now ufw >/dev/null 2>&1
...

K8s requires a bridged network and must forward ipv4 traffic. So we write the config to the relevant files.

...
    - name: bridge network
      copy:
        dest: "/etc/modules-load.d/containerd.conf"
        content: |
          overlay
          br_netfilter          
    
    - name: forward ipv4 traffic
      copy:
        dest: "/etc/sysctl.d/kubernetes.conf"
        content: |
          net.bridge.bridge-nf-call-iptables  = 1
          net.bridge.bridge-nf-call-ip6tables = 1
          net.ipv4.ip_forward                 = 1          
    
    - name: apply bridge network
      become: yes
      shell: modprobe overlay && modprobe br_netfilter && sysctl --system
...

We could reboot the VMs to apply these configurations but there is a quicker way. modprobe adds the modules to the kernel, and sysctl command applies the ipv4 forwarding without a reboot.

The next task is to install a container runtime. This runtime is the one that starts and stops containers, among other things. The runtime was abstracted away from kubernetes and container runtime interface (CRI) was created in 2016. There are a couple of well known CRIs which includes containerd and CRI-O. Docker shim support was removed from k8s v1.24 but docker uses containerd underneath anyway.

As mentioned before, we needed an up to date containerd so kubelet in version 1.26 can work with it.

- name: Apt-key for containerd.io [1/2]
  become: yes
  shell: |
        curl -fsSL https://download.docker.com/linux/debian/gpg 

- name: Add containerd.io repository [2/2]
  become: yes
  shell: |
    echo \
      "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/debian \
      $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null        

Thanks (but no thanks) to the way containerd is packaged, we need to make an adjustment to the config file it generated. In its config file, we have to tell cgroup that we will use systemd. If we don’t, k8s will use cgroupfs instead, and it will no longer match with kubectl which uses systemd.

Thus, we need to change the line SystemdCgroup = false to true for the section [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]. For that, we use both grep and sed.

- name: Tell containerd to use systemd
  shell: |
    mkdir -p /etc/containerd && \
    containerd config default > /etc/containerd/config.toml && \
    sed -i 's/SystemdCgroup \= false/SystemdCgroup \= true/g' /etc/containerd/config.toml    

The desired state of that config.toml is,

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
    ...
    --> [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options] <-- for this
        --> SystemdCgroup = true <-- desired value

Next, we install kubelet, kubeadm, and kubectl.

- name: add Kubernetes apt-key
  apt_key:
    url: https://packages.cloud.google.com/apt/doc/apt-key.gpg
    state: present

- name: add Kubernetes' APT repository
  apt_repository:
    repo: deb https://apt.kubernetes.io/ kubernetes-xenial main
    state: present
    filename: 'kubernetes'

- name: install kubelet
  apt:
    name: kubelet={{ K8S_VERSION }}
    state: present

- name: install kubeadm
  apt:
    name: kubeadm={{ K8S_VERSION }}
    state: present

- name: install kubectl
  apt:
    name: kubectl={{ K8S_VERSION }}
    state: present

The variable K8S_VERSION is defined in the vars.yaml file. Lastly, we freeze all packages so that we do not accidentally upgrade any of the versions.

- name: Hold versions
  shell: apt-mark hold kubeadm={{ K8S_VERSION }} kubelet={{ K8S_VERSION }} kubectl={{ K8S_VERSION }}

The reason why we freeze its version is we should do any k8s upgrade manually. We need to read release notes carefully. We do not want to bring down our cluster and there is an order into which program should be upgraded. Finally, depending on which version your tool is, there is a specific range of version you can upgrade to. Look over at k8s’s skew policy.

Ok. Run this playbook with

ansible-playbook -u root --key-file "vagrant" 04-k8s.yaml --extra-vars "@vars.yaml"

Our next step is to initialise the control plane nodes. We already have set up haproxy and keepalived. The floating IP 172.16.16.100 is going to be our entry point.

So the playbook has a declaration on one node, in this case on node control1, but it doesn’t matter which as long as it is in one of the control plane nodes.

# 05-control-plane.yaml
- hosts: control1
  become: yes
  tasks:
    - name: initialise the cluster
      shell: kubeadm init --control-plane-endpoint="{{ VIRTUAL_IP }}:{{ K8S_API_SERVER_PORT }}" \
             --upload-certs \
             --pod-network-cidr=192.168.0.0/16 \
             --apiserver-advertise-address=172.16.16.101
      register: cluster_initialized

We use the kubeadm init command, and we tell where the control plane entry point is. The port number is the default 6443. Best practice is to use Classless Inter-Domain Routing (CIDR) for --control-plane-endpoint, for example --control-plane-endpoint="myk8s.local:6443". But you need to add myk8s.local to your /etc/hosts file in all control plane nodes first.

--upload-certs uploads temporary certificates to a Secret object in the cluster which will expire in two hours. This is enough time to join other control planes and worker nodes. Say in the future you want to add more worker nodes to this cluster, we need to generate new a certificate (and token).

The pod network CIDR, is required to tell the range of pod subnet IPs. This CIDR depends on the Container Network Interface (CNI) that we choose. In our case, we choose Calico which requires the CIDR to be 192.168.0.0/16. If you had chosen Flannel for example, it will be 10.244.0.0/16 instead. For a list of some CNI alternatives, look over at https://kubernetes.io/docs/concepts/cluster-administration/addons/#networking-and-network-policy.

Finally, a special argument is needed because we have used vagrant and libvirt. We need to tell where this command is coming from by telling this control1 node IP address with --apiserver-advertise-address option. This is a special case specific to vagrant. If you were using other types of nodes such as ec2 or digital ocean for example, this option might not be needed.

The kubeadm init step will take some time to complete, almost two minutes for me. If for some reason this step fails, and we want to retry, then we need to reset both the cluster and firewall.

kubeadm reset --force && rm -rf .kube/ && rm -rf /etc/kubernetes/ && rm -rf /var/lib/kubelet/  \
    && rm -rf /var/lib/etcd && iptables -F && iptables -t nat -F && \ 
    iptables -t mangle -F && iptables -X
# or using ansible
ansible-playbook -u root --key-file "vagrant" XX-kubeadm_reset.yaml

If everything goes well, we want to be able to use the kubectl command. For that to work, we can either export KUBECONFIG,set the path to /etc/kubernetes/admin.conf add to ~/.bashrc or we can simply copy its config file to $HOME/.kube/config .

- name: create .kube directory
  become: yes
  become_user: kubeadmin
  file:
    path: /home/kubeadmin/.kube
    state: directory
    mode: 0755

- name: copy admin.conf to user's kube config
  copy:
    src: /etc/kubernetes/admin.conf
    dest: /home/kubeadmin/.kube/config
    remote_src: yes
    owner: kubeadmin

Once this is done, you can run kubectl cluster-info in control1 node.

kubeadmin@kcontrolplane1:~$ kubectl cluster-info
Kubernetes control plane is running at https://172.16.16.100:6443
CoreDNS is running at https://172.16.16.100:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

Note that its entry point is the virtual IP we made using keepalived. Congratulations, you now have a k8s cluster! Now let us make it highly available.

Before we join more nodes to our control plane, let us see if we can access our first control plane node using our floating IP. Remember that haproxy pings /healthz endpoint of kubernetes apiserver to know if it is alive or not? Let us see the response.

First, we need to SSH into one of the load balancer nodes when the virtual IP is alive. Check with ip a command and look for the existence of 172.16.16.100 ip address.

ssh kubeadmin 172.16.16.51
ip a

Once you are in the right node, simply curl into the control plane apiserver healthiness endpoint

# -k ignore ssl cert, -v is verbose mode
curl -kv https://172.16.16.100:6443/healthz

You will get a 200 HTTP status response.

*   Trying 172.16.16.100:6443...
* Connected to 172.16.16.100 (172.16.16.100) port 6443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=kube-apiserver
*  start date: Nov 23 09:53:08 2022 GMT
*  expire date: Nov 23 09:55:33 2023 GMT
*  issuer: CN=kubernetes
*  SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55ae57c6fad0)
> GET /healthz HTTP/2
> Host: 172.16.16.100:6443
> user-agent: curl/7.74.0
> accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
< HTTP/2 200    <-------------------------------------------------------------- we want 200
< audit-id: ebc57fd0-a58d-4a30-8c2a-5584ee50fd90
< cache-control: no-cache, private
< content-type: text/plain; charset=utf-8
< x-content-type-options: nosniff
< x-kubernetes-pf-flowschema-uid: eb7cafce-8a2e-4dde-900e-ef916d1edc9e
< x-kubernetes-pf-prioritylevel-uid: 60dc2404-44cb-4b46-992b-8561247ff4f1
< content-length: 2
< date: Fri, 02 Dec 2022 06:52:06 GMT
< 
* Connection #0 to host 172.16.16.100 left intact

Now that we are successful in pinging apiserver, let us join two more nodes to the control plane. To achieve this, we simply generate both a certificate and a join token.

kubeadm init phase upload-certs --upload-certs (1)
kubeadm token create --print-join-command      (2)

Using the information from these two outputs, we join both control2 and control3 nodes to control1.

- hosts: control2,control3
  become: yes
  tasks:
    - debug: msg="{{ hostvars['control1'].cert_command }}"
    - debug: var="{{ ansible_eth1.ipv4.address }}"

    - name: join control-plane
      shell: "{{ hostvars['control1'].join_command }} \
              --control-plane \
              --certificate-key={{ hostvars['control1'].cert_command }} \
              --apiserver-advertise-address={{ ansible_eth1.ipv4.address|default(ansible_all_ipv4_addresses[0]) }} \
               >> node_joined.txt"
      register: node_joined

{{ hostvars['control1'].join_command }} join comes from (2).

--control-plane indicates that we are joining a control plane.

--certificate-key value is obtained from the new temporary certificate we generated from (1).

--apiserver-advertise-address is important because of the vagrant setup we have. Again, ignore if using other types of nodes. The command obtains current ipv4 IP address for each control2 and control3 nodes.

Finally, we copy /etc/kubernetes/admin.conf to $HOME/.kube/config to be able to use the kubectl command. Run this playbook with

ansible-playbook -u root --key-file "vagrant" 05-control-plane.yaml --extra-vars "@vars.yaml"

On successful join, run kubectl get no to see all nodes. no is short for nodes.

kubeadmin@kcontrolplane1:~$ kubectl get no
NAME             STATUS   ROLES           AGE   VERSION
kcontrolplane1   Ready    control-plane   82m   v1.26.0
kcontrolplane2   Ready    control-plane   80m   v1.26.0
kcontrolplane3   Ready    control-plane   80m   v1.26.0

Next task is to join our three worker nodes to the control plane. These worker nodes are where our applications will live. On worker1 node, we run the following playbook.

# 06-workers.yaml

- hosts: control1
  become: yes
  gather_facts: false
  tasks:
    - name: get join command
      shell: kubeadm token create --print-join-command
      register: join_command_raw

    - name: set join command
      set_fact:
        join_command: "{{ join_command_raw.stdout_lines[0] }}"

- hosts: workers
  become: yes
  tasks:
    - name: join cluster
      shell: "{{ hostvars['control1'].join_command }} >> node_joined.txt"
      args:
        chdir: $HOME
        creates: node_joined.txt

The kubeadm token create --print-join-command is exactly the same as we have done in the previous playbook. But here, it is repeated because if in the future we need to add more worker nodes, we can simply re-run this playbook - it won’t affect current running worker nodes.

Run with

ansible-playbook -u root --key-file "vagrant" 06-worker.yaml

One last thing is we want to be able to use the kubectl command from the host machine. So we copy /etc/kubernetes/admin.conf from one of the control plane nodes to our host. This is not a compulsory step because we can already run kubectl commands inside the cluster by first SSH-ing. The reason we do this is if we want to proxy nginx container so that we can see its output in a browser in our local computer - which I will show shortly.

ansible-playbook -u root --key-file "vagrant" 07-k8s-config.yaml

Now, you can run kubectl command on the local computer. If you check the nodes, you will see six of them. Before you can deploy your apps to this cluster, you need to make sure the worker nodes are ready first which can take about two minutes. What I like to do is to use watch command for example watch -n 1 kubectl get no.

kubeadmin@kcontrolplane1:~$ kubectl get no
NAME             STATUS   ROLES           AGE   VERSION
kcontrolplane1   Ready    control-plane   82m   v1.26.0
kcontrolplane2   Ready    control-plane   80m   v1.26.0
kcontrolplane3   Ready    control-plane   80m   v1.26.0
kworker1         Ready    <none>          79m   v1.26.0
kworker2         Ready    <none>          79m   v1.26.0
kworker3         Ready    <none>          79m   v1.26.0

Congratulations. You now have a highly available k8s cluster running on your machine!

Now to verify that we can deploy something in the k8s cluster, let us try with deploying nginx, and additionally, access from a local machine.

To do that we need to do two things. First, create a deployment. This will download the nginx image and deploy a pod in our cluster. Secondly, we need to expose the port, so we can access the pod.

In local machine, run:

kubectl create deployment nginx-deployment --image=nginx
kubectl expose deployment nginx-deployment --port=80 --target-port=80

Nginx listens on port 80 by default, then we choose an arbitrary target port number, also 80, that we are going to connect.

Now if we check the pods,

kubectl get po

it will show that it is still creating.

$ kubectl get po
NAME                                READY   STATUS              RESTARTS   AGE
nginx-deployment-5fbdf85c67-jgfzj   0/1     ContainerCreating   0          2s

We can use the watch command like this

watch -n 1 kubectl get po

and now we know the pod is ready.

Every 1.0s: kubectl get po

NAME                                READY   STATUS    RESTARTS   AGE
nginx-deployment-5fbdf85c67-jgfzj   1/1     Running   0          85s

We can also confirm the IP address of the pods by using the wide option. You will see that it is assigned to the 192.168.0.0/16 subnet for calico which we defined when we initialise the cluster.

kubeadmin@kcontrolplane1:~$ kubectl get po -o wide
NAME                                READY   STATUS    RESTARTS   AGE     IP              NODE       NOMINATED NODE   READINESS GATES
nginx-deployment-5fbdf85c67-jgfzj   1/1     Running   0          6m40s   192.168.140.1   kworker3   <none>           <none>

There are a number of ways to access our container. When we expose the deployment using kubectl expose deployment command, it creates a service of type ClusterIP by default. It is a type of service where we can access a pod from any node, as long as you are within the k8s cluster. So, you need to SSH into one of the control plane nodes beforehand. For example, let us check the service.

$ kubectl get svc

NAME               TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
kubernetes         ClusterIP      10.96.0.1        <none>        443/TCP          9d
nginx-deployment   ClusterIP      10.103.141.186   <none>        80/TCP         6d23h

As you can see, the deployment is assigned an IP of ClusterIP type at 10.103.141.186. So once you SSH, you can curl the service, and you will get an HTML response from nginx.

ssh kubeadmin@172.16.16.103
curl 10.103.141.186

For now, we are most interested in accessing the nginx default page using a local browser. To do that, we need to perform a port forward. We are port forwarding nginx to another port in local because I already have a local nginx installed. Let us pick an unprivileged port which is between 1024 and 65535, say 8080.

The way we are accessing the pod is by doing a port forward. In production, you would want to look at ingress like metallb or nginx. We will explore that option in part three. For now, a port forward suffices for local access.

$ kubectl port-forward deployment/nginx-deployment 8080:80 
 
Forwarding from 127.0.0.1:8080 -> 80
Forwarding from [::1]:8080 -> 80
Handling connection for 8080

Now go to your browser and access http://localhost:8080, and you will see the default nginx welcome page!

nginx default welcome page in k8s

Once you are done playing with the cluster, you can suspend the VMs by running vagrant halt. To resume, simple run vagrant up again.

Conclusion

It has been a long journey and this post has a lot of emphasise into creating a highly-available k8s cluster. This post might be long (and complex) but in reality kubeadm has made the process easy by automating many tedious steps. If you want to dive into more details however, you can install k8s step-by-step using Kelsey Hightower’s Kubernetes The Hard Way.

Ok, so the question is, is this k8s cluster production-ready? We already have got a highly-available setup. Or is it? We have only installed k8s inside a single computer. What if someone trips the Wi-Fi router cable! So best practice is to have each of the three control plane nodes installed in different availability zones - for example, one node in each of the three data centres in Sydney. That way, if power goes out in one data centre, the other two data centres can pick up the traffic. More details in the documentation. To add more resilience, we can use one node in each of aws, gcp, and azure.

Of course there still a lot more thing to talk about k8s since we have only scratched the surface, and we have only deployed a simple nginx server. What I will do in the next part of the series is to deploy containerised front and back end applications into our cluster, and then do cool stuff like increasing replication and updating application versions and watching how k8s magically do it for us. Visibility into our applications is important as well so, we will look at opentelemetry for observability.