<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://jlamweil.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://jlamweil.github.io/" rel="alternate" type="text/html" /><updated>2026-05-27T13:25:52+00:00</updated><id>https://jlamweil.github.io/feed.xml</id><title type="html">Joseph Lam-Weil</title><subtitle>Personal website — Joseph Lam-Weil</subtitle><entry><title type="html">Coding as queues</title><link href="https://jlamweil.github.io/lessons/coding-as-queues/" rel="alternate" type="text/html" title="Coding as queues" /><published>2026-05-27T00:00:00+00:00</published><updated>2026-05-27T00:00:00+00:00</updated><id>https://jlamweil.github.io/lessons/coding-as-queues</id><content type="html" xml:base="https://jlamweil.github.io/lessons/coding-as-queues/"><![CDATA[<h2 id="the-intuition">The intuition</h2>

<p>An ML training pipeline is not a script — it’s a sequence of transforms connected by queues. Data enters, is validated, cleaned, featurized, split, trained, evaluated. Each stage can fail independently. The question is not “did it run?” but “where did it stop?”</p>

<p>Once you see every ML workflow as a queue, you stop writing monolithic training scripts and start designing resilient data flows.</p>

<h2 id="why-it-matters">Why it matters</h2>

<p>Monolithic training scripts fail catastrophically at line 142 and lose 6 hours of computation. Queue-oriented design lets you checkpoint, retry, and inspect at each stage. It is the single highest-leverage architectural shift for moving from notebook experiments to production ML.</p>

<h2 id="what-i-learned">What I learned</h2>

<p>I used to write training scripts as one long .py file: load → clean → train → evaluate. When something failed mid-way, I’d comment out half the file and re-run. Now I think in stages: each stage is a function that reads from a known location and writes to a known location. The “queue” is just a directory of files waiting to be processed. This pattern has eliminated more debugging sessions than any linter or type checker.</p>

<hr />

<h2 id="deep-dive">Deep dive</h2>

<h2 id="formal-definition">Formal definition</h2>

<p>A queue-oriented pipeline is a directed acyclic graph of stages where each stage:</p>

<ul>
  <li>Reads from exactly one input location (file, topic, table)</li>
  <li>Writes to exactly one output location</li>
  <li>Is idempotent (re-running produces the same output)</li>
  <li>Reports success/failure atomically</li>
</ul>

<p>The queue is the intermediate storage between stages. It decouples producers from consumers.</p>

<h2 id="full-example">Full example</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">import</span> <span class="nn">json</span>

<span class="k">def</span> <span class="nf">stage</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">input_dir</span><span class="p">,</span> <span class="n">output_dir</span><span class="p">,</span> <span class="n">fn</span><span class="p">):</span>
    <span class="s">"""Run one stage: read all items from input_dir, apply fn, write to output_dir."""</span>
    <span class="n">input_path</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">input_dir</span><span class="p">)</span>
    <span class="n">output_path</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">output_dir</span><span class="p">)</span>
    <span class="n">output_path</span><span class="p">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">parents</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

    <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">input_path</span><span class="p">.</span><span class="n">iterdir</span><span class="p">():</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="n">item</span><span class="p">.</span><span class="n">is_file</span><span class="p">():</span>
            <span class="k">continue</span>
        <span class="c1"># Idempotent: skip if already processed
</span>        <span class="k">if</span> <span class="p">(</span><span class="n">output_path</span> <span class="o">/</span> <span class="n">item</span><span class="p">.</span><span class="n">name</span><span class="p">).</span><span class="n">exists</span><span class="p">():</span>
            <span class="k">continue</span>
        <span class="n">data</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">loads</span><span class="p">(</span><span class="n">item</span><span class="p">.</span><span class="n">read_text</span><span class="p">())</span>
        <span class="n">result</span> <span class="o">=</span> <span class="n">fn</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
        <span class="n">output_path</span><span class="p">.</span><span class="n">joinpath</span><span class="p">(</span><span class="n">item</span><span class="p">.</span><span class="n">name</span><span class="p">).</span><span class="n">write_text</span><span class="p">(</span><span class="n">json</span><span class="p">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">result</span><span class="p">))</span>

<span class="c1"># Usage: chain stages
# stage("load", "raw/", "validated/", validate)
# stage("featurize", "validated/", "features/", build_features)
# stage("train", "features/", "models/", train_model)
</span></code></pre></div></div>

<h2 id="caveats-and-edge-cases">Caveats and edge cases</h2>

<ul>
  <li>Queue overhead is only worth it when stages take &gt;30s or can fail independently</li>
  <li>Not all workflows need a message broker — a filesystem directory is often enough</li>
  <li>Idempotency is the hard part: ensure your transforms are pure functions of their input</li>
  <li>Monitoring is essential — if a stage silently stops, the queue fills up and nothing notices</li>
</ul>

<h2 id="references">References</h2>

<ul>
  <li>“You Are Not a Queue” — the anti-pattern of over-engineering</li>
  <li>Apache Beam / TensorFlow Transform for the production version of this pattern</li>
  <li>“Data Pipelines with Python” (McMaster, 2023) for directory-as-queue implementations</li>
</ul>]]></content><author><name></name></author><category term="ml" /><category term="data-engineering" /><category term="architecture" /><summary type="html"><![CDATA[ML workflows are not programs — they are pipelines. Treating each stage as a transform in a queue changes how you design for failure.]]></summary></entry><entry><title type="html">Small data beats big confusion</title><link href="https://jlamweil.github.io/lessons/small-data-beats-big-confusion/" rel="alternate" type="text/html" title="Small data beats big confusion" /><published>2026-05-22T00:00:00+00:00</published><updated>2026-05-22T00:00:00+00:00</updated><id>https://jlamweil.github.io/lessons/small-data-beats-big-confusion</id><content type="html" xml:base="https://jlamweil.github.io/lessons/small-data-beats-big-confusion/"><![CDATA[<p>More data is not always the answer. When examples are scarce, inductive bias — the assumptions baked into your model — dominates performance.</p>

<h2 id="the-small-data-regime">The small-data regime</h2>

<p>With enough data, almost any reasonable model will converge to the same solution. With little data, the choice of model <em>is</em> the experiment.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">sklearn.gaussian_process</span> <span class="kn">import</span> <span class="n">GaussianProcessRegressor</span>
<span class="kn">from</span> <span class="nn">sklearn.gaussian_process.kernels</span> <span class="kn">import</span> <span class="n">RBF</span><span class="p">,</span> <span class="n">WhiteKernel</span>

<span class="c1"># A kernel that encodes smoothness + noise
</span><span class="n">kernel</span> <span class="o">=</span> <span class="n">RBF</span><span class="p">(</span><span class="n">length_scale</span><span class="o">=</span><span class="mf">1.0</span><span class="p">)</span> <span class="o">+</span> <span class="n">WhiteKernel</span><span class="p">(</span><span class="n">noise_level</span><span class="o">=</span><span class="mf">0.1</span><span class="p">)</span>
<span class="n">gp</span> <span class="o">=</span> <span class="n">GaussianProcessRegressor</span><span class="p">(</span><span class="n">kernel</span><span class="o">=</span><span class="n">kernel</span><span class="p">)</span>
<span class="n">gp</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
</code></pre></div></div>

<p>A GP with a well-chosen kernel can extract signal from 20–50 points that a linear model would miss entirely.</p>

<h2 id="the-principle">The principle</h2>

<blockquote>
  <p>All models are wrong, but the right prior makes the small-data problem tractable.</p>
</blockquote>

<p>This is not about “using a complicated model” — it is about encoding what you already know: smoothness, additivity, monotonicity, periodicity.</p>

<h2 id="what-this-means-in-practice">What this means in practice</h2>

<ul>
  <li>Spend time on feature engineering — it is a stronger prior than any regulariser.</li>
  <li>Use Bayesian methods when you can quantify uncertainty.</li>
  <li>Test on a held-out set even if it is just 5–10 points.</li>
  <li>Report uncertainty intervals, not just point estimates.</li>
</ul>]]></content><author><name></name></author><category term="data-science" /><category term="machine-learning" /><category term="statistics" /><summary type="html"><![CDATA[When you cannot get more data, think harder about the structure of the problem.]]></summary></entry><entry><title type="html">Avoiding Validation Leakage in Small ML Experiments</title><link href="https://jlamweil.github.io/lessons/validation-leakage/" rel="alternate" type="text/html" title="Avoiding Validation Leakage in Small ML Experiments" /><published>2026-05-22T00:00:00+00:00</published><updated>2026-05-22T00:00:00+00:00</updated><id>https://jlamweil.github.io/lessons/validation-leakage</id><content type="html" xml:base="https://jlamweil.github.io/lessons/validation-leakage/"><![CDATA[<p>Validation leakage usually shows up when preprocessing learns from the full dataset before the split. In small experiments, that can make a weak model look surprisingly strong.</p>

<p>The safest pattern is simple: split first, fit transforms only on the training fold, and keep the test set untouched until the very end.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="kn">from</span> <span class="nn">sklearn.pipeline</span> <span class="kn">import</span> <span class="n">Pipeline</span>
<span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">StandardScaler</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LogisticRegression</span>

<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span>
    <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">test_size</span><span class="o">=</span><span class="mf">0.2</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">,</span> <span class="n">stratify</span><span class="o">=</span><span class="n">y</span>
<span class="p">)</span>

<span class="n">model</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">([</span>
    <span class="p">(</span><span class="s">"scale"</span><span class="p">,</span> <span class="n">StandardScaler</span><span class="p">()),</span>
    <span class="p">(</span><span class="s">"clf"</span><span class="p">,</span> <span class="n">LogisticRegression</span><span class="p">(</span><span class="n">max_iter</span><span class="o">=</span><span class="mi">1000</span><span class="p">)),</span>
<span class="p">])</span>

<span class="n">model</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">score</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Holdout accuracy: </span><span class="si">{</span><span class="n">score</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>

<p>When the dataset is tiny, cross-validation helps, but only if every step lives inside the CV loop. Otherwise the leakage just becomes harder to see.</p>]]></content><author><name></name></author><category term="machine-learning" /><category term="data-science" /><category term="evaluation" /><summary type="html"><![CDATA[A short note on how leakage sneaks into model selection and how to keep your validation loop honest.]]></summary></entry></feed>