No Time for Caution: Getting Started with TimescaleDB
TimescaleDB. Is. Dope.
Back in the bad ol’ days when I worked on transient astronomy, all my timeseries data fit in memory. I’d store it in a CSV and not give it a second thought. But if you want to build a web application that leverages timeseries data, that won’t fly. You might think “I can’t store my timeseries data in a relational database, I need to stand up a whole new database for my timeseries stuff and I need to learn a whole new tsdb system and blah blah blah…”
I’m here to tell you that Postgres has got you covered. TimescaleDB is a Postgres extension that turns Postgres into a fully fledged timeseries database. And it’s not like they shoehorned some no-good raggedy slow tsdb into a Postgres interface. No, TimescaleDB is extremely performant. You get the best of both worlds: you write SQL you already know and (maybe) love, AND you get a lightning fast timeseries database. The cherry on top is that it’s super easy to self host on your own database.
Ok, first I decided to upgrade my PostgreSQL installation from 14 to 16. If your database is managed (e.g., on AWS), then you probably already have support for the extension and don’t need to worry about this, but that’s not my situation. Here’s what I did to upgrade my Postgres cluster (credit to this post):
# install postgresql
sudo apt install -y postgresql-common
sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh
# drop the newest cluster
sudo pg_dropcluster 16 main --stop
# upgrade the old cluster to the new version
sudo pg_upgradecluster 14 main
# drop the old cluster
sudo pg_dropcluster 14 main
# remove old PostgreSQL version from the system
sudo apt purge postgresql-14 postgresql-client-14
Note that I did create a backup and then restored it afterwards (see pg_backup
and pg_restore
). I also had to reinstall postgis
with sudo apt install postgresql-16-postgis-3 && sudo apt install postgis
because I had some other databases in this cluster that need GIS support, so beware of that when upgrading your Postgres clusters.
Next, install TimescaleDB following the instructions here. This is all pretty straightforward. Just make sure you click the tab for your OS, and also after you’ll need to do sudo timescaledb-tune
and then restart with sudo systemctl restart postgresql
, and then you should be good to do CREATE EXTENSION IF NOT EXISTS timescaledb
. And you know what? I’m feeling ambitious; lets do pgvector
too so I can jump on the LLM hypetrain and play around with some embeddings. In a shell, do sudo apt install postgresql-16-pgvector
. Then, in a psql
process or in a SQL script, do CREATE EXTENSION IF NOT EXISTS vector
.
Verify with \dx
:
List of installed extensions
Name | Version | Schema | Description
-------------+---------+------------+---------------------------------------------------------------------------------------
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
timescaledb | 2.16.0 | public | Enables scalable inserts and complex queries for time-series data (Community Edition)
vector | 0.7.3 | public | vector data type and ivfflat and hnsw access methods
Boom, done. Gotta love Postgres amirite? Ok, moving on…
The project I have in mind is pretty simple: I want to track the number of votes and downloads for Kaggle datasets and notebooks over time. Sounds pretty straightforward. Let’s talk schema. Here’s a great blog post from TimescaleDB about timeseries schema design. I’m going to favor a “narrow” schema; I’ll have one table per metric because it’s nice and simple and easy to manage.
Unfortunately, atlas
doesn’t seem to work well with the DDL/migration commands needed to set up timeseries tables because it uses a “clean” dev DB and due to the TimescaleDB extension tables, it does not clean up properly. Not to worry though, we can brew install golang-migrate
and we’re off to the races using a more conventional migration tool. Tutorial is here for the uninitiated.
Also, I’m building the models out with sqlc
, so we’ll have to maintain a schema.sql
file alongside our migrations. Not ideal, but I’m not going to let perfect be the enemy of good enough. If you really need to, you can run pg_dump -d ${DATABASE_URL} -s -b -v -f pgdump.sql
and then copy/paste the schema into your schema.yaml
. Alright, let’s get this thing built!
Here’s my 000001_create_votes_table.up.sql
migration file that will set up the schema:
BEGIN;
CREATE EXTENSION IF NOT EXISTS timescaledb;
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE IF NOT EXISTS notebook_votes (
slug VARCHAR(255) NOT NULL,
ts TIMESTAMPTZ NOT NULL,
val INTEGER NOT NULL
);
SELECT create_hypertable('notebook_votes', 'ts', if_not_exists => TRUE);
CREATE INDEX notebook_votes_slug ON notebook_votes (slug, ts);
CREATE TABLE IF NOT EXISTS notebook_downloads (
slug VARCHAR(255) NOT NULL,
ts TIMESTAMPTZ NOT NULL,
val INTEGER NOT NULL
);
SELECT create_hypertable('notebook_downloads', 'ts', if_not_exists => TRUE);
CREATE INDEX notebook_downloads_slug ON notebook_downloads (slug, ts);
CREATE TABLE IF NOT EXISTS dataset_votes (
slug VARCHAR(255) NOT NULL,
ts TIMESTAMPTZ NOT NULL,
val INTEGER NOT NULL
);
SELECT create_hypertable('dataset_votes', 'ts', if_not_exists => TRUE);
CREATE INDEX dataset_votes_slug ON dataset_votes (slug, ts);
CREATE TABLE IF NOT EXISTS dataset_downloads (
slug VARCHAR(255) NOT NULL,
ts TIMESTAMPTZ NOT NULL,
val INTEGER NOT NULL
);
SELECT create_hypertable('dataset_downloads', 'ts', if_not_exists => TRUE);
CREATE INDEX dataset_downloads_slug ON dataset_downloads (slug, ts);
COMMIT;
Here’s my 000001_create_votes_table.down.sql
for reference:
BEGIN;
DROP TABLE notebook_votes;
DROP TABLE notebook_downloads;
DROP TABLE dataset_votes;
DROP TABLE dataset_downloads;
COMMIT;
Here’s the queries I want to run, with their respective sqlc
annotations:
-- name: InsertNotebookVotes :exec
INSERT INTO notebook_votes (slug, ts, val)
VALUES (@slug, NOW()::TIMESTAMPTZ, @votes);
-- name: GetNotebookVotes :many
SELECT *
FROM notebook_votes
WHERE slug = @slug;
-- name: InsertNotebookDownloads :exec
INSERT INTO notebook_downloads (slug, ts, val)
VALUES (@slug, NOW()::TIMESTAMPTZ, @downloads);
-- name: GetNotebookDownloads :many
SELECT *
FROM notebook_downloads
WHERE slug = @slug;
-- name: InsertDatasetVotes :exec
INSERT INTO dataset_votes (slug, ts, val)
VALUES (@slug, NOW()::TIMESTAMPTZ, @votes);
-- name: GetDatasetVotes :many
SELECT *
FROM dataset_votes
WHERE slug = @slug;
-- name: InsertDatasetDownloads :exec
INSERT INTO dataset_downloads (slug, ts, val)
VALUES (@slug, NOW()::TIMESTAMPTZ, @downloads);
-- name: GetDatasetDownloads :many
SELECT *
FROM notebook_votes
WHERE slug = @slug;
Now just run sqlc generate
and you’re done. Yeah, that’s all. At this point we’re ready to wire this puppy up and to an HTTP server and start ingesting and serving timeseries! Here’s an example:
func handleNotebookVotesPost(l *slog.Logger, q *dbgen.Queries) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
type payload struct {
Slug string `json:"slug"`
Votes float64 `json:"votes"`
}
var p payload
defer r.Body.Close()
err := json.NewDecoder(r.Body).Decode(&p)
if err != nil {
writeBadRequestError(w, err)
return
}
if p.Slug == "" {
writeBadRequestError(w, fmt.Errorf("must supply timeseries slug"))
return
}
err = q.InsertNotebookVotes(
r.Context(),
dbgen.InsertNotebookVotesParams{
Slug: p.Slug, Votes: int32(p.Votes)})
if err != nil {
writeInternalError(l, w, err)
}
writeOK(w)
}
}
func handleNotebookVotesGet(l *slog.Logger, q *dbgen.Queries) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
slug := r.URL.Query().Get("slug")
if slug == "" {
writeBadRequestError(w, fmt.Errorf("must supply timeseries slug"))
return
}
res, err := q.GetNotebookVotes(r.Context(), slug)
if err != nil {
writeInternalError(l, w, err)
return
}
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(res)
}
}
One more thing before I go: materialized views and continuous aggregates. TimescaleDB alone can buy you over an order of magnitude speed up in query performance, and making use of continuous aggregates (which are basically auto-refreshing materialized views) can buy you another 1-2 orders of magnitude speed up (!!!) and they’re dead simple to set up. Read the docs linked above, but all you have to do is something like:
CREATE MATERIALIZED VIEW IF NOT EXISTS notebook_votes_mat
WITH (timescaledb.continuous) AS
SELECT slug,
time_bucket(INTERVAL '15 min', ts) AS bucket,
MAX(val)
FROM notebook_votes
GROUP BY slug, bucket;
SELECT add_continuous_aggregate_policy('notebook_votes_mat',
start_offset => INTERVAL '7 days',
end_offset => INTERVAL '1 hour',
schedule_interval => INTERVAL '1 hour');
Alrighty. I’m going to pause the post here and get to work populating these tables, but I’ll be back with some semantic search stuff with pgvector
in the next post or two, too. Until next time…