kaashif's blog

Programming, with some mathematics on the side

Home
//
About
//
Contact
//
Archive
//
GitLab

Containerizing my transcript search app

2019-06-07

Until recently, my transcript search web app was running (at https://transcripts.kaashif.co.uk, check it out) in a tmux session, with a PostgreSQL server running on the same machine in the usual way, as a daemon.

The web app knows nothing about its dependency on the database, this information is not recorded anywhere except in the code itself. And the database knows nothing about the web app. This isn't a huge problem except the database has a config of its own which isn't recorded anywhere in the source repo. If you try to get the web app to work with a misconfigured database, it won't work, of course.

Wouldn't it be nice if all of that configuration were in one place? And if the services all restarted themselves if they failed? And if you could migrate the entire blob of interconnected web apps and databases to a different machine with a single command?

That's where Docker comes in!

What problem are we trying to solve?

I was thinking of migrating VPS providers, and I realised that my strategy of running my web app in a tmux window was poor. All of the configs are spread out over the system, the database has been configured to run a certain way and I have a certain way of running the web app which isn't even recorded in any script anywhere. This means someone trying to host the app on their own would have to do a lot of guesswork or rely on me to tell them what to do.

There's also a lack of fault tolerance: if the web app crashes, I have no idea and it certainly doesn't restart itself. If the database crashes, I similarly have no idea. It would be nice if we could get some auto-restarting behaviour.

Which tools will we use?

The solution I chose was Docker Compose. In the words of that page I just linked:

Compose is a tool for defining and running multi-container Docker applications.

In our case, one container is the web app and one the database. This is a very common setup and is probably exactly the use case Docker Compose was designed for. This is apparent if you go through the examples in the docs.

For those who don't know what Docker is, here is a helpful explanation:

Enterprise Container Platform for High-Velocity Innovation: Securely build, share and run any application, anywhere

-- https://www.docker.com/

Just kidding, that's weapons-grade business nonsense. A container is essentially a really lightweight copy of your machine where exactly one program is running. It shares the kernel and hardware, but has its own network and bundles its own filesystem (with all its dependencies). Here is an actually good explanation.

What does the end result look like?

Rather than describing the long and tedious process of trying to get everything to work and learning how to use Docker, let my just show you the end result.

The git repo for my web app looks like:

.
|-- build.sh
|-- cli
|-- data
|-- docker-compose.yml
|-- Dockerfile
|-- lib
|-- LICENSE
|-- make_pretty.sh
|-- README.md
|-- stack.yaml
|-- transcripts
|-- transcript-search.cabal
`-- web

Assuming you have the transcript parser built (just install Haskell Stack and run stack build --copy-bins, which will get the GHC compiler, all dependencies, build and copy the binaries to the right place), we only need to focus on:

data: contains the transcript data, ready to be loaded into the SQL database
docker-compose.yml: defines the relationships between the containers and what the names of the images used are
Dockerfile: defines our web app container

The data directory

This is produced using transcript-parse, the Swiss Army Knife of sci-fi TV show transcript parsers. A small niche, but a very important one. There are only two files here:

transcripts.tsv, which is a tab-separated values file with the entire transcript database. This is the biggest file, at 72 MB. Not quite big data yet.

load_data.sh, which the database container will pick up (more on this later) and use to load the TSV file.

docker-compose.yml

The docs for Docker Compose aren't bad, you should check them out if you want to see some more examples. My code is essentially just another example to add to the list. They actually don't have a Python Flask web app as one, so maybe my code will be instructive.

Here is the entire file:

version: '3.1'

services:
  web:
    depends_on:
      - postgres
    build: .
    ports:
      - "1234:8000"
    environment:
      DB_HOST: postgres
      DB_PASS: [redacted]
    restart: always
  postgres:
    image: postgres:11
    ports:
      - "5432:5432"
    environment:
      POSTGRES_USER: transcripts
      POSTGRES_PASSWORD: [redacted]
      POSTGRES_DB: postgres
    volumes:
      - ./data:/docker-entrypoint-initdb.d
    restart: always

For a full reference of what all of these keywords mean, check out the compose file reference: https://docs.docker.com/compose/compose-file/.

For now, I'll just focus on how it solves my problems.

The full relationship between the services is defined in one file. This is great, since it means there is no need to trawl through the code to see where the database connections are made.
restart: always! So if the web app crashes, it will just restart and the database won't know. If the database crashes, it will restart. Maybe the web app will crash too, but then it will restart and at some point the system will be working again.
depends_on means the services get started in the correct order: database then web app. This means we don't have to care about waiting for the server to come up in the web app code, we can just assume it's always up. Docker Compose handles enforcing this condition for us.

A technical note: when the database container is started for the first time, it mounts the data directory into its filesystem at a mount point with a special name. The Postgres image has a script somewhere that scans this directory looking for scripts to run. There is a lot of complexity hidden inside the prebuilt Postgres image, but we do not need to worry about any of this.

Our custom container definition

This is very simple, we just need something to run a Flask app. I picked Gunicorn, a decent enough WSGI HTTP server for Python apps. Here's the entire Dockerfile:

FROM ubuntu:18.04

MAINTAINER Kaashif Hymabaccus "kaashif@kaashif.co.uk"

RUN apt-get update && \
    apt-get install -y gunicorn3 python3-flask python3-psycopg2

COPY ./web /web
WORKDIR /web

EXPOSE 8000
ENTRYPOINT [ "gunicorn3" ]
CMD [ "-b", "0.0.0.0:8000", "app:app" ]

Building the container just involves copying the scripts and assorted goodies (in the web directory) into the container. Running the container just runs the web app.

Getting this up and running

$ docker-compose start

No, really, it's that easy! This will build the containers and start them in the right order for you. With full fault-tolerance, isolation and so on. Feels too easy.

Conclusion

Docker is great and I hope to use it in more projects in the future. Maybe at some point I'll make it big and have to delve into the world of Kubernetes and Docker Swarm - orchestration on a larger scale.

Until then, I'm happy with this small-scale success and I highly encourage you to try containerizing your web apps.

Happy hacking!