· 2 MIN READ

Trust, but Verify: Checking Backprop to 1e-16

How I verified hand-derived gradients against PyTorch autograd to machine precision before training anything on CIFAR-10 — and why the overfit-a-tiny-subset trick catches what gradient checks miss.

  • #machine-learning
  • #numpy
  • #deep-learning
  • #methodology

For KTH's Deep Learning in Data Science course, I built CIFAR-10 classifiers from scratch in NumPy — first a softmax linear model, then a two-layer network with ReLU. "From scratch" means the backward pass is my algebra: every gradient hand-derived and hand-implemented.

Here's the problem with hand-implemented backprop: a slightly wrong gradient still trains. The loss goes down. Accuracy looks plausible. You can run an entire hyperparameter study on top of a subtly broken backward pass and never know. Wrong gradients don't crash — they just quietly cap how good your model can get.

So before training anything, I made verification the first milestone.

Step 1: autograd as ground truth

The trick is to make PyTorch compute the same loss with the same weights, let autograd differentiate it, and compare against my analytic gradients elementwise:

def relative_error(g_mine, g_torch, eps=1e-12):
    num = np.abs(g_mine - g_torch)
    den = np.maximum(eps, np.abs(g_mine) + np.abs(g_torch))
    return (num / den).max()

My target wasn't "close." It was machine precision — the gap you get from floating-point rounding alone. Final max error: on the order of 1e-16. That number matters because it's binary: 1e-16 means the math is identical; 1e-4 means something is wrong and "probably just numerics" is cope.

A subtlety that bit me: dtype. NumPy defaults to float64, PyTorch to float32. Compare across dtypes and you'll chase phantom errors at 1e-7 forever. Cast everything to float64 first.

Step 2: overfit a tiny subset

Gradient checking proves the backward pass matches the forward pass. It does not prove the training loop is right — learning-rate application, batch shuffling, regularization bookkeeping can all still be broken.

The cheapest test for that: take ~100 images and train until the model memorizes them. A correct implementation should hit ~100% training accuracy and near-zero loss. If it can't overfit 100 images, it certainly can't learn 50,000. This catches a disjoint set of bugs from the gradient check, and the two together cover most of the implementation space.

What verified arithmetic buys you

Everything downstream became trustworthy measurement instead of vibes:

  • The cyclical learning-rate schedule (Smith 2015) did exactly what the paper promised, visible cleanly in the loss curves.
  • The coarse-to-fine λ search over ~45k images produced smooth, reproducible validation surfaces — λ = 1.07e-3 for the final 3-cycle run.
  • The bonus experiments surfaced a genuinely interesting result: adding data augmentation consistently shifted the optimal λ smaller. Augmentation and L2 act as substitute regularizers — a claim I could make confidently because I knew the gradients underneath were exact.

Final test accuracy: 39.30% for the linear model, 49.77% for the two-layer network with all four bonus improvements. Respectable for raw pixels — but the number I'm actually proud of is the 1e-16.

The general lesson

This is the same instinct as hardware bring-up: you don't trust a sensor until you've calibrated it against a reference. Backprop is no different. Verify the instrument before you run the experiment.