Joseph Lam-Weil

Coding as queues

2026-05-27T00:00:00+00:00

The intuition

An ML training pipeline is not a script — it’s a sequence of transforms connected by queues. Data enters, is validated, cleaned, featurized, split, trained, evaluated. Each stage can fail independently. The question is not “did it run?” but “where did it stop?”

Once you see every ML workflow as a queue, you stop writing monolithic training scripts and start designing resilient data flows.

Why it matters

Monolithic training scripts fail catastrophically at line 142 and lose 6 hours of computation. Queue-oriented design lets you checkpoint, retry, and inspect at each stage. It is the single highest-leverage architectural shift for moving from notebook experiments to production ML.

What I learned

I used to write training scripts as one long .py file: load → clean → train → evaluate. When something failed mid-way, I’d comment out half the file and re-run. Now I think in stages: each stage is a function that reads from a known location and writes to a known location. The “queue” is just a directory of files waiting to be processed. This pattern has eliminated more debugging sessions than any linter or type checker.

Deep dive

Formal definition

A queue-oriented pipeline is a directed acyclic graph of stages where each stage:

Reads from exactly one input location (file, topic, table)
Writes to exactly one output location
Is idempotent (re-running produces the same output)
Reports success/failure atomically

The queue is the intermediate storage between stages. It decouples producers from consumers.

Full example

from pathlib import Path
import json

def stage(name, input_dir, output_dir, fn):
    """Run one stage: read all items from input_dir, apply fn, write to output_dir."""
    input_path = Path(input_dir)
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    for item in input_path.iterdir():
        if not item.is_file():
            continue
        # Idempotent: skip if already processed
        if (output_path / item.name).exists():
            continue
        data = json.loads(item.read_text())
        result = fn(data)
        output_path.joinpath(item.name).write_text(json.dumps(result))

# Usage: chain stages
# stage("load", "raw/", "validated/", validate)
# stage("featurize", "validated/", "features/", build_features)
# stage("train", "features/", "models/", train_model)

Caveats and edge cases

Queue overhead is only worth it when stages take >30s or can fail independently
Not all workflows need a message broker — a filesystem directory is often enough
Idempotency is the hard part: ensure your transforms are pure functions of their input
Monitoring is essential — if a stage silently stops, the queue fills up and nothing notices

References

“You Are Not a Queue” — the anti-pattern of over-engineering
Apache Beam / TensorFlow Transform for the production version of this pattern
“Data Pipelines with Python” (McMaster, 2023) for directory-as-queue implementations

Small data beats big confusion

2026-05-22T00:00:00+00:00

More data is not always the answer. When examples are scarce, inductive bias — the assumptions baked into your model — dominates performance.

The small-data regime

With enough data, almost any reasonable model will converge to the same solution. With little data, the choice of model is the experiment.

import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel

# A kernel that encodes smoothness + noise
kernel = RBF(length_scale=1.0) + WhiteKernel(noise_level=0.1)
gp = GaussianProcessRegressor(kernel=kernel)
gp.fit(X_train, y_train)

A GP with a well-chosen kernel can extract signal from 20–50 points that a linear model would miss entirely.

The principle

All models are wrong, but the right prior makes the small-data problem tractable.

This is not about “using a complicated model” — it is about encoding what you already know: smoothness, additivity, monotonicity, periodicity.

What this means in practice

Spend time on feature engineering — it is a stronger prior than any regulariser.
Use Bayesian methods when you can quantify uncertainty.
Test on a held-out set even if it is just 5–10 points.
Report uncertainty intervals, not just point estimates.

Avoiding Validation Leakage in Small ML Experiments

2026-05-22T00:00:00+00:00

Validation leakage usually shows up when preprocessing learns from the full dataset before the split. In small experiments, that can make a weak model look surprisingly strong.

The safest pattern is simple: split first, fit transforms only on the training fold, and keep the test set untouched until the very end.

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

model = Pipeline([
    ("scale", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000)),
])

model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(f"Holdout accuracy: {score:.3f}")

When the dataset is tiny, cross-validation helps, but only if every step lives inside the CV loop. Otherwise the leakage just becomes harder to see.