How I Learned to Stop Worrying and Trust Statistics

It’s 3am and a dashboard line moved. Someone got paged. They’re awake now, squinting at a wiggle, trying to decide if it’s real. Here’s the uncomfortable truth about that moment: the number always changed. Every point on every chart is different from the last one. The question that actually matters — the only question — is whether the process that generates the number changed. And almost nobody answers that question with anything more principled than vibes.

This post is about answering it with one constant from 1924, a bucket of Parquet files, and DuckDB.

Long-time readers know I have a soft spot for venerable, empirical statistics that punch way above their weight. My Poisson confidence intervals calculator exists because of Gehrels (1986) — a few decades old, basically a lookup table, and still the thing I reach for whenever “we observed zero” needs to become an actual limit on a rate (zero never fluctuates to one, but one sometimes fluctuates to zero). There’s a pattern here: the sturdiest tools in statistics tend to be old, simple, and allergic to assumptions. Today’s entry is older still.

In the last post I replaced Kafka with Postgres and a bucket, and promised a growing gallery of expensive infrastructure short-circuited by boring alternatives. One of the items on that ballot was the observability stack — and before I can short-circuit the storage side of that (coming, I promise), I needed the brain: the thing that looks at a metric and tells you, with a straight face, whether anything actually happened. That brain is called statistical process control, the tool is called an XmR chart, and the whole algorithm fits on a napkin.

The two expensive mistakes⌗

When a metric moves and you have to decide whether to care, there are exactly two ways to be wrong:

Chase routine noise. You burn an investigation on nothing. Worse, if you “fix” a stable process in response to individual points, you’re tampering — Deming demonstrated this provably increases variation. You made it worse by reacting.
Dismiss a real shift. The regression ships. The pump fails. The fraud continues.

Most alerting setups handle this tradeoff by… someone typing a threshold into a YAML file. Page when latency > 500ms. Says who? Based on what? That number is a vibe with a uniform. SPC replaces it with limits the process itself tells you.

The whole algorithm⌗

Take your metric’s values from a baseline period — a few weeks where you believe nothing weird happened. Compute:

X̄    = mean(x)
mR̄   = mean(|xᵢ − xᵢ₋₁|)        # average gap between consecutive points
UNPL = X̄ + 2.66 · mR̄            # upper natural process limit
LNPL = X̄ − 2.66 · mR̄            # lower natural process limit

Freeze those limits. Forever (or until you deliberately change the process). Then two rules, and only two:

Rule 1: a point lands outside the limits → something happened, go find it.
Rule 2: nine consecutive points on one side of the center line → the level shifted, go find out why.

Inside the limits? Nothing happened. Not “probably nothing” — the honest, statistically defensible answer is that points inside natural process limits carry no explanation. There is no root cause to find. Go back to sleep. That’s the entire pitch: this is anomaly detection you can run in your head, and people have replaced fleets of deep-learning anomaly detectors with exactly this, at three or four orders of magnitude fewer parameters, because a small team can actually understand and trust it.

Where 2.66 comes from (and why your weird data doesn’t break it)⌗

Two ingredients. For consecutive points drawn from a stable process the following is (roughly) true for almost any distribution: E|xᵢ − xᵢ₋₁| = 1.128σ. So the mean moving range gives you a sigma estimate: σ̂ = mR̄ / 1.128. And limits go at ±3 sigma — Shewhart’s economic choice, a century of practice balancing false-alarm cost against missed-signal cost. Multiply: 3 / 1.128 = 2.66. Not arbitrary, and — this is the part everyone gets wrong — not a normality assumption.

“But my data isn’t normal!” Good news: nothing above assumed it was. Donald Wheeler — the modern evangelist of this whole approach — has been making this exact point for forty years, most pointedly in his Quality Digest piece Do You Have Leptokurtophobia? and at book length in Understanding Statistical Process Control. The short version: testing your data for normality before you’ll trust a process-behaviour chart is the wrong move, because the chart was never asking for normality in the first place. Suppose the absolute worst. Suppose an adversary designs your distribution:

What you’re willing to assume	P(stable point beyond 3σ)
Nothing at all (finite variance) — Chebyshev	≤ 1/9 ≈ 11.1%
Unimodal, that’s it — Vysochanskij–Petunin	≤ 4/81 ≈ 4.9%
Normal	0.27%

Even against the pathological worst case, 3-sigma limits false-alarm on at most one stable point in nine. Add the single weakest assumption you can check by squinting at a histogram — one hump — and the ceiling drops under 5%.

But bounds are bounds and vibes are not a benchmark, so I measured the whole procedure: 28-point baseline, mR̄-estimated sigma, frozen limits, then count false alarms on 500 in-control points, 2,000 trials per distribution. Estimation error included, nothing hidden:

distribution	false alarms per stable point
uniform	0.03%
bimodal mixture	0.04%
normal	0.81%
exponential (skewed)	3.43%
lognormal (heavy tail)	4.49%
pareto α=2.5 (very heavy tail)	4.80%

Every monster lands under the unimodal bound and at less than half of Chebyshev’s ceiling. You don’t need to know your distribution. That’s not a slogan, it’s a table.

(If you’d rather watch it than read it: I built a little interactive deck where the distribution morphs from normal to lognormal to pareto while the ±3σ lines hold still and the tail-mass counter refuses to budge. Same argument, in motion.)

One trap worth calling out: you might think the limits are just mean ± 3·std(data). They are not, and the difference is the whole trick. The global standard deviation is inflated by the very signals you’re hunting — inject a shift and the SD swells, the limits swell with it, and the chart goes blind to its own signal. The moving range only sees point-to-point variation, so a shift contaminates exactly one of its terms. Never compute limits with the global SD. (Also resist everyone who wants to “tune” 2.66 to 2 or 3.5. That’s how a chart degenerates back into a YAML vibe.)

So I built duck-spc⌗

A century-old algorithm is cute; a century-old algorithm running over your entire telemetry archive in one SQL query is useful. Following my own advice, my data already lives as date-partitioned Parquet on a bucket. So I built duck-spc: the XmR math — limits, derived streams, both detection rules — pushed down into DuckDB SQL over read_parquet(). The column contract is one line: your rows look like (ts, category…, value[, exposure]), where the categories define the streams (think region, service) and exposure handles observations-per-unit normalization when rows carry unequal weight. The moving range is just a window function:

SELECT region, service,
       avg(value)                                  AS center,
       avg(abs(value - lag(value) OVER w))         AS mr_bar,
       avg(value) + 2.66 * avg(abs(value - lag(value) OVER w)) AS unpl
FROM derived_stream
WHERE ts >= ? AND ts < ?          -- the frozen baseline window
GROUP BY region, service ...

Thousands of streams, one scan, nothing materialized in Python except answers.

The happy path is one command. Point it at a bucket and BOOM:

duck-spc look --source 's3://my-bucket/events/' \
  --value latency_ms --group-by region,service --derive day:p95

── region=us-east, service=checkout ────────────────────────────
                          ●
 335.2 ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄  UNPL
            ·    ·    ·          ·      ·     ··   ·  ·
 331.1 ──·──··─·───··──·──··──·───··─·───··──·───·──··──── X̄
       · ·    ·   ·   ·  ·   ··  ·   · ·    ·   ·    ·
 326.9 ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄  LNPL
       2026-01-01 ── dim = baseline, checked from 2026-01-29 ──
 ✗ 1 signal point(s) — first 2026-02-10 (rule1)

2/4 group(s) show special-cause variation — go find the cause(s).

ASCII XmR charts per stream, right in your terminal, signals in red. The --derive flag handles the dirty secret of real telemetry — raw streams are seasonal, trending, and noisy, so you chart a derived stationary stream instead: day:mean, day:p95, day:rate (that’s sum(value)/sum(exposure), computed in-engine because the ratio of sums is not the mean of ratios), or first differences.

When exploration graduates to production, the verbs decompose Unix-style. baseline freezes the limits into a JSON artifact that carries its own provenance — source, derivation, window, per-stream limits — so every verdict is traceable to the data that justified it. check scores new data against the frozen artifact and puts the verdict in the exit code: 0 means stable (no news is good news), 1 means signals. And because reports embed their limits, everything pipes:

duck-spc baseline --source ... --window 2026-01-01:2026-01-29 > limits.json
duck-spc check --limits limits.json          # cron-friendly: exit code talks
duck-spc check --limits limits.json | duck-spc visualize    # human investigating
duck-spc chart --limits limits.json --group us-east,checkout -o incident.png

That artifact is also where the doctrine lives. Limits are computed once from an explicit window and frozen — re-baselining is a deliberate act (re-run baseline after a verified process change), never automatic. Rolling windows are the classic self-own here: the limits absorb every anomaly into the baseline, the chart adapts to the disease, and your monitoring goes quietly, permanently blind.

The numbers⌗

Measured on my laptop, because — say it with me — vibes are not a benchmark:

2.16 million rows, 1,000 streams: per-stream limits computed in 0.11s; a full check scoring 62,000 daily points against frozen limits in 0.6s. One process, no service, no GPU, no model registry.
The detection math is cross-checked point-for-point against a reference numpy implementation in the test suite, and the synthetic data generator plants known signals (a spike, a sustained shift, a variance change) that the tests must recover exactly — the clean stream must stay silent.

What it’s not⌗

The chart tells you whether and when the process changed — never why (that’s your job) and never what happens next (that’s forecasting; the chart only tells you whether the future is likely to resemble the past). And if you chart raw seasonal data without deriving a stationary stream first, you’ll get noise — that’s not the chart failing, that’s the chart faithfully reporting that Mondays differ from Sundays.

What’s next⌗

This slots straight into the short-circuit worldview from the Kafka post. The roadmap: a DuckLake catalog as a source (same API, snapshots and time travel underneath), nonparametric quantile limits for streams too ugly even for the gauntlet, and bolting this onto the Postgres-log hot path so live telemetry flows in one end and boring, trustworthy verdicts come out the other. The “Splunk + Prometheus → DuckLake and a cron job” short-circuit I’ve written about previously suddenly has its brain.

A hundred years ago Shewhart figured out how to tell signal from noise with arithmetic a clerk could do by hand. We’ve spent the last decade or so re-solving that problem with anomaly-detection services that bill by the gigabyte and page you about Tuesdays. One constant, two rules, a bucket, and a duck. Stop worrying. Trust statistics.