DuckLake: Generally Regarded as a Good Idea

Back again! I came across an interesting YouTube video about DuckLake. If you want to know about DuckLake, watch/read that content; I’m not going to regurgitate it here (I simply cannot do it justice). The gist can be summarized by this snippet from the manifesto “Once a database has entered the Lakehouse stack anyway, it makes an insane amount of sense to also use it for managing the rest of the table metadata! "

Instead, this post is going to be a real time walkthrough of me standing up a data lakehouse with DuckLake. It’s roughly 6:30pm; let’s see how long it takes me (this will include the time it takes me to RTFM).

Here’s my assumptions/selling points motivating this:

It’s fast.
It’s cheap.
Totally decoupled storage and compute.
Vendor agnostic (all you need is a SQL DB and object storage).
Trivially encrypt your object store data.
Finally true “multiplayer” DuckDB.

Ok, I want to see how these prove out. Rather than take their word for it, I’m going to stand up a DuckLake service.

Here’s the tech stack:

DuckDB running on my Macbook Pro. Nothing special.
Metadata stored in a Neon Postgres DB.
Object storage Cloudflare R2.

First things first, let’s install DuckLake. First you need DuckDB >= 1.3.0. Install it:

curl https://install.duckdb.org | sh

Now install DuckLake:

 »  duckdb
DuckDB v1.3.0 (Ossivalis) 71c5c07cdd
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D install ducklake;
D

…that’s it? Ok…now let’s set up our object storage. Go to Cloudflare, and click on R2. I had to add a subscription for it since I haven’t used it. It’s free if you’re below 10GB or whatever. Create a bucket, then go into settings and enable Public Development URL. We’re going to encrypt our object storage data so we don’t care if anyone can read our bucket. Here’s my bucket: https://pub-27235883166947ddb7f64d4c40002cad.r2.dev. Grab your API key too; you’ll need the API key, and you can also Access Key ID and Secret Access Key for S3 compatible clients. You can also grab your Account ID, which is in your S3 URL.

Let’s use those credentials to enable duckdb to use S3 compatible storage. Note that R2 does support S3, but duckdb has a special secret type to make this a tad easier:

INSTALL httpfs;
LOAD httpfs;
CREATE OR REPLACE SECRET secret (
    TYPE r2,
    KEY_ID '12345',
    SECRET '1234567890',
    ACCOUNT_ID 'abc123'
);

Next, let’s set up our Postgres DB for the catalog. Go into Neon, create a new project and associated database and grab your credentials. Here’s how we set this up in DuckDB (with encryption!):

INSTALL postgres;
ATTACH 'ducklake:postgres:dbname=ducklake_catalog user=[USER] password=[PASSWORD] host=[HOST].us-west-2.aws.neon.tech' AS my_ducklake
    (DATA_PATH 'r2://ducklake-test', ENCRYPTED);
USE my_ducklake;

Ummmm…uhhh…wait. We’re done? Shirley you can’t be serious. Let’s smoke test this.

-- Create a simple table
CREATE TABLE ONLINE_GAMING_PLATFORM (
    GAME_ID INTEGER,
    GAME_TITLE VARCHAR,
    GAME_LAUNCH_DATE DATE
);

-- Insert some dummy data
INSERT INTO ONLINE_GAMING_PLATFORM (GAME_ID, GAME_TITLE, GAME_LAUNCH_DATE) VALUES
(1, 'Cosmic Conquest', '2023-03-15'),
(2, 'Mystic Realms', '2022-11-01'),
(3, 'Cyber Sprint', '2024-01-20');

-- Check if the data is there
SELECT * FROM ONLINE_GAMING_PLATFORM;

-- Now, let's see if the data ended up in R2.
FROM glob('r2://ducklake-test/');

Holy shit, it works. That’s nuts. You can go through the rest of features to convince yourself it works. For instance, you can update like regular SQL, and you’ll see more files pop up:

UPDATE ONLINE_GAMING_PLATFORM SET GAME_TITLE = 'foo' WHERE GAME_ID = 1;
FROM glob('r2://ducklake-test/');

And you can time travel:

FROM my_ducklake.snapshots();
SELECT GAME_TITLE FROM ONLINE_GAMING_PLATFORM AT (VERSION => 2);

Finally, to detach, you just need to select another DB as the default, and then detach:

USE memory;
DETACH my_ducklake;

Wow, that’s actually wild. This is really impressive from both a technology standpoint, but also from a developer experience standpoint. I mean, I’m a DuckDB novice and it’s barely 7:30. It took me less than an hour to stand this thing up from scratch, and that includes time spent setting up Postgres and R2, as well as reading the docs for the configuration nuances. Ok, well, I’m profoundly impressed, BYE.

DuckLake: Generally Regarded as a Good Idea

Comments