Using PostgreSQL to search transcripts
2019-03-31
Remember my transcript search engine, https://transcripts.kaashif.co.uk?
I'm not a database expert, but even I realised that spending hours and hours trying to optimise my homebrew database transcript search engine was a waste of time. For no reason at all other than to try it out, I went with ElasticSearch. The astute reader will notice that ElasticSearch is meant for "big" data. My collection of transcripts tops out at few dozen megabytes at most - this is most certainly not big (or even medium-sized) data, really.
So after getting some real-world experience with SQL databases (at a real company) and taking an in-depth database algorithms course at university, I decided to convert my web app to use a PostgreSQL database to store the transcripts, searching with bog-standard SQL queries.
There were a couple of neat tricks I used to speed things up, too.
The ultimate optimisation
I wrote an article about how I optimised my transcript parser. Lots of great tips there, but the greatest optimisation of them all is not running the code at all. Why do we need any parsing done while the web app is running, anyway? Of course, we don't - the transcripts are static and unchanging, we can just pre-parse everything.
What happens now is the parser runs once and parses the transcripts into a table with these columns:
create table raw_transcripts (
series text,
season_number int,
episode_number int,
episode_title text,
scene_number int,
line_number int,
person text,
present text,
place text,
speech text);
The old data was nested: a series has episodes, an episode has scenes, each scene has lines. Scenes have the same people present and take place in a single place. So of course, there is a lot of redundancy introduced when we de-nest the transcript structure like this.
The key thing to notice is that the data is so small that none of this matters!
Instead of doing some parsing for each request, we now just do some database calls. An upside to this is that the Haskell code only runs once, offline, so it can be written for maximum readability rather than speed.
Essentially, the whole of that optimisation article is obsolete.
Expected workload
We need to know what the expected workload is so that we can organise our indices, primary keys, clustering, etc. This has a huge impact on performance.
In a typical session with the transcript search engine, a user will search for a fragment of speech, maybe with place, speaker, episode, series or season specified. Indexing on speech is essentially useless - a user will probably never want to search for a line of speech using exact equality. Since this is the most important column, indexing is basically useless for speeding up searches, we'll always have to do a full table scan.
While the other columns like season number and episode number seem mildly indexable, the reality is that a user will basically never use these. No-one remembers a line by which season it was in, they usually just remember some fragment of the line.
The most important optimisation here is related to the form the user expects the results in: the line of speech with some surrounding context.
Context is for kings
Our web app wants to make queries that look like:
SELECT * FROM transcripts WHERE speech ILIKE %somestring%;
As discussed before, this is always going to incur a full table scan, indices can't help.
Then we want to get the context:
SELECT * FROM transcripts
WHERE line_number >= x-2 AND line_number <= x+2;
But here is an opportunity to cluster the table. Why not order the lines on-disk according to their line numbers? Then we can simply look at the surrounding records, which have already probably been loaded into memory with the matching record - databases load records into memory in batches or pages.
To take advantage of this, I associate an ID to each line which is globally unique and orders each line in chronological (as they appear in each script) order.
An informed reader might notice that PostgreSQL doesn't really support clustered indices. This is true, but we can cluster the table on an index and, as long as the table never changes (this is true in our case), the clustering will be maintained. Inasmuch as never interacting with the order counts as "maintaining" it.
In PostgreSQL, this is easy, just add an id
column as the PRIMARY
KEY
, and you get indexing automatically. We can cluster the table
using CLUSTER transcripts ON id
Since we are exclusively going to be
doing very small range queries (maybe for 5 records max), we might
want a hash index or something. But this never caused a performance
issue, so I left it as-is, with the default B-Tree index.
End result
I switched out the Haskell web app stuff with a bog-standard Python web app written with Flask about a hundred lines long. I'd never want to write a parser in Python and I didn't really want to write a web app in Haskell, so this situation is much better.
Performance is through the roof compared to the old handrolled searching code and even the ElasticSearch engine (I'm sure it's fantastic for big data...).
What an effect a tiny bit of database knowhow can have!