Simon Willison's llm, datasette, and sqlite-utils

I came across this post by Xuanwo about starting a link blog in the spirit of Simon Willison’s post here. This is pretty close to what I’m striving for here, and it spurred me to write up some snippets about how I’ve gotten started with Simon’s amazing tools: llm, datasette, and sqlite-utils (and a few helpers I’ll mention below).

llm⌗

First things first, here’s what amazing about llm:

Super easy to install/run (e.g., uvx llm "tell me about yourself")
Agnostic with respect to the provider (you can use locally served models just as easily as “premium” models hosted by the big providers).
Every command gets logged to a local sqlite database so you can do do whatever analytics you want after the fact (and it plays nice with datasette).
Emebeddings work in conjunction with the above right out of the box.
You don’t need to leave the command line; the first time you use llm to construct an esoteric bash one-liner will be revelatory.
There’s a Python API so you can integrate it with your applications.

First, install uv following my last post. Then install Simon’s “suite” of CLI tools:

uv tool install llm
uv tool install datasette
uv tool install sqlite-utils
uv tool install ttok
uv tool install strip-tags
uv tool install shot-scraper
uv tool install files-to-prompt
uv tool list
# csvs-to-sqlite v1.3
# - csvs-to-sqlite
# datasette v0.65.1
# - datasette
# files-to-prompt v0.4
# - files-to-prompt
# llm v0.21
# - llm
# shot-scraper v1.5
# - shot-scraper
# sqlite-utils v3.38
# - sqlite-utils
# strip-tags v0.5.1
# - strip-tags
# ttok v0.3
# - ttok

The most important concept related to these tools is “plugins”. These are built out using pluggy, and Simon has provided lots of in depth examples for potential contributors so if you’re interested, go have a read.

For llm, the plugins directory is here. One of the cool aspects of llm is that it’s agnostic with regard to how the LLM your using is deployed; it’s just as easy to point it at a “premium” offering like OpenAI as it is a local model you’re running with ollama. Typically it’s just a matter of llm install llm-[provider] or populating a config file.

Here are the “premium” APIs you’re likely to want (OpenAI is available by default):

llm-claude-3
llm-mistral
llm-gemini

Here are the most accessible local Local APIs you’re likely to want (at the time of this writing):

llm-gpt4all
llm-ollama
llm-llamafile

It is freakishly easy to run a very good LLM locally these days. You can install and run ollama by downloading a binary and then doing ollama pull my-model (e.g. llama3.2) and you’re off to the races with llm -m llama3.2 'tell me about yourself'. Same deal with llamafile, but it’s even easier! A single binary you can run on any platform has both the API and the model itself!!!

Of course, you can do embeddings as well. I haven’t had a chance to spend too much time on embeddings just yet, but Simon has a few posts you should read. You can run these locally too; you’ll likely want the llm-sentence-transformers plugin. You can do bonafide RAG if you’ve got ~10^5 items. Any more than that and you’ll likely start wanting a specialized vector database. The pertinent commands are:

llm embed: embed content
llm embed-models: shows embedding models
llm embed-multi: lets you read in a DB

Here are some other neat plugins you’ll want to check out:

llm-cmd accepts a prompt for a shell command, runs that prompt and populates the result in your shell so you can review it, edit it and then hit to execute or ctrl+c to cancel.
llm-cmd-comp provides a key binding for your shell that will launch a chat to build the command. When ready, hit and it will go right back into your shell command line, so you can run it.
llm-python adds a llm python command for running a Python interpreter in the same virtual environment as LLM. This is useful for debugging, and also provides a convenient way to interact with the LLM Python API if you installed LLM using Homebrew/pipx/uv.
llm-cluster adds a llm cluster command for calculating clusters for a collection of embeddings. Calculated clusters can then be passed to a Large Language Model to generate a summary description.
llm-jq lets you pipe in JSON data and a prompt describing a jq program, then executes the generated program against the JSON.
llm-classify is a classifier plugin (similar API as skllm).

Templating is a neat feature if you’re doing “prompt engineering”. It let’s you save prompts and parametrize them. Pretty neat. Read the docs for more info.

You can pipe file content into llm; this is particularly powerful when used in conjuction with files-to-prompt. You can do a similar thing with shot-scraper, which is more powerful than you might guess from the name: you can run Javascript and pipe the output of that into llm. So you can parse web pages and then have llm search/summarize for you, like poor man’s RAG.

cat README.md | llm -s 'You are groot; summarize this README.'
files-to-prompt *.py | llm -s 'You are groot; summarize these python files.'

Honorable mention: you can use yt-dlp to pull data from YouTube and pipe that into llm too. Wild.

Ok, that was a regrettably short overview of llm. I only hit the highlights; you should absolutely read the documentation (which is quite nice) and then you’ll really appreciate how intuitive the API is.

datasette⌗

Now let’s jump into datasette. All you need is a sqlite database. Like llm, you can unlock a lot of powerful functionality by installing plugins. Go here to see most starred plugins. You may find yourself installing these at some point for particular use cases (you can just datasette [un]install some_package):

datasette install datasette-vega
datasette install datasette-plot
datasette install datasette-render-markdown
datasette install datasette-search-all
datasette install sqlite-vss
datasette install sqlite-lines
datasette install sqlite-html
datasette install sqlite-http
datasette install datasette-dashboards
datasette install datasette-graphql
datasette install datasette-insert
datasette install datasette-leaflet-freedraw
datasette install datasette-copyable

Pages and API Endpoint Notes⌗

Here’s how datasette works from a high level. Every row maps to a URL. Rows that are the targets of foreign key references from other tables will show a link to a filtered search for all records that reference that row. datasette provides a JSON API for your SQLite databases. Anything you can do through the datasette user interface can also be accessed as JSON via the API. To access the JSON for a page, either click on the JSON link on that page or edit the URL and add a .json extension to it.

The default JSON representation includes a “next_url” key which can be used to access the next page of results. If that key is null or missing then it means you have reached the final page of results. If you know anything about how pagination scales, you’ll be glad to know that datasette does it the Right Way that doesn’t require a full table scan so it’s snappy even when you have lots of pages. Check out this blog for a deep dive on why OFFSET/LIMIT doesn’t scale well.

datasette can detect foreign key relationships and resolve those references into labels. The HTML interface does this by default for every detected foreign key column - you can turn that off using ?_labels=off. The relevant point here is that it shows you what labels are.

Querying SQL⌗

You can perform arbitrary SQL reads with datasette. Values from named parameters are treated as SQLite strings. If you need to perform numeric comparisons on them you should cast them to an integer or float first using cast(:name as integer) or cast(:name as real), for example:

select * from Street_Tree_List
where latitude > cast(:min_latitude as real)
and latitude < cast(:max_latitude as real)

As an alternative to adding views to your database (which may be fine), you can define canned queries inside your metadata.json file. Canned queries by default are read-only. You can use the "write": true key to indicate that a canned query can write to the database.

sqlite has the ability to run queries that join across multiple databases. Up to ten databases can be attached to a single SQLite connection and queried together. datasette can execute joins across multiple databases if it is started with the --crossdb option.

Authentication⌗

This is off by default but does exist through plugins. This is very extensive and beyond the scope of this post. The key question the permissions system answers is this: Is this actor allowed to perform this action, optionally against this particular resource?

Binary Data⌗

datasette includes special handling for binary values. The datasette interface detects binary values and provides a link to download their content. Several datasette plugins are available that change the way datasette treats binary data.

datasette-render-binary modifies the default interface to show an automatic guess at what type of binary data is being stored, along with a visual representation of the binary value that displays ASCII strings directly in the interface.
datasette-render-images detects common image formats and renders them as images directly in the datasette interface.
datasette-media allows datasette interfaces to be configured to serve binary files from configured SQL queries, and includes the ability to resize images directly before serving them.

datasette facets can be used to add a faceted browse interface to any database table. With facets, tables are displayed along with a summary showing the most common values in specified columns. These values can be selected to further filter the table. This is super useful for drilling into data.

The performance of facets can be greatly improved by adding indexes on the columns you wish to facet by. Adding indexes can be performed using the sqlite3 command-line utility: sqlite3 mydatabase.db and CREATE INDEX Food_Trucks_state ON Food_Trucks("state"), or the shorter version with the sqlite-utils tool: sqlite-utils create-index mydatabase.db Food_Trucks state

Full Text Search⌗

This exists and is powerful. You can enable full text search easily via the Python API, or you can use sqlite-utils if you’d like. Plugins like datasette-search-all are useful once you’ve enabled FTS.

Spatial⌗

The SpatiaLite module for sqlite adds features for handling geographic and spatial data. For an example of what you can do with it, see the tutorial Building a location to time zone API with SpatiaLite.

Custom Pages and Templates⌗

You can customize pretty much everything style-wise. I’m not going to go into the specifics here because this is firmly in documentation territory, but you should know that you can.

sqlite-utils⌗

Ok, now for the final bit. sqlite-utils is probably the most mundane of the tools but arguably the most useful. In short, it gives you a suite of tools for creating, reading, and manipulating sqlite databases. The API is exceptional due to the way it let’s you transform data by inlining Python snippets. It also has a Python API, so you can leverage it in your Python code as well.

Again, the documentation for this is great. I’ll just show you how to get started and then you can take it from there.

To run a vanilla SQL command: sqlite-utils dogs.db "select * from dogs". You can control the output format in various ways. It’s JSON by default; if you’re familiar with jq, you can just pipe it into that.

You can also run DML queries; the return value for those will indicate the number of rows affected.

You can use the --functions option to pass a block of Python code that defines additional functions which can then be called by your SQL query.

This example defines a function which extracts the domain from a URL:

sqlite-utils query sites.db "select url, domain(url) from urls" --functions '
from urllib.parse import urlparse

def domain(url):
    return urlparse(url).netloc
'

Every callable object defined in the block will be registered as a SQL function with the same name, with the exception of functions with names that begin with an underscore.

A couple other things of note: you can load extensions, attach additional databases, and use an in memory database (this lets you run queries directly against CSVs and/or JSON).

Take a look at the --schema, --analyze, --dump and --save commands. THey’re super useful (e.g., sqlite-utils dogs.db --dump). You can also list tables, views, indexes, and triggers; the syntax is like sqlite-utils tables mydb.db --counts --columns.

Typically you’ll be reading in CSVs. This looks like sqlite-utils insert dogs.db dogs dogs.csv --csv. There’s some tricks you can do loading in unstructured data; this is useful when you leverage transformations with custom functions while reading data in. You can insert/replace, upsert, and execute bulk inserts.

You can transform data; this is super useful when it comes to normalizing data: sqlite-utils convert content.db articles headline 'value.upper()'. Here’s something neat you can do: cat mycode.py | sqlite-utils convert content.db articles headline -. You can limit that to just rows that match a WHERE clause using --where, which itself could be parametrized.

Those are the basics. For the final time, I recommend reading over the documentation; it’s WILD what you can do with this tool. This is definitely more appealing to me than loading CSVs into Pandas and doing who knows what. What’s even better is that ibis interfaces just as nicely with sqlite as with any other backend. In other words, this sets the groundwork for a pretty seamless workflow for preprocessing and cleaning data.

Alright, this has been quite a lengthy post. Hopefully I’ve included all the starter snippets needed to get rolling; for the remaining questions, just check the docs. Til next time!

Simon Willison’s llm, datasette, and sqlite-utils