Ever built something so carefully that its most valuable answer turned out to be “no”?

That’s pretty much the story of my last few weeks. I went looking for edge in the mean reversion behind pairs trading — the whole statistical-arbitrage playbook — and built a framework as rigorous as I could to test the idea honestly instead of flatteringly. Then I ran it on three different asset universes, one after another, and the backtest results failed me three times in a row.

This whole project started life as my final project for the CQF (Certificate in Quantitative Finance). I passed — but when I handed it in, I had that nagging feeling that I’d rushed it, and that I could do it far more thoroughly if I actually gave it the time it deserved. A couple of months later, I went back, rebuilt the framework from scratch, and let it run on real data until it told me something I didn’t want to hear. This post is the story of those three rejections — stocks, ETFs, and commodity futures — and why I’ve come around to thinking of this carefully built framework like the manager who rejects every proposal you bring him… and turns out to be right every time.

Previous readings

First, what does it even mean for a spread to “revert”?

The dream of pairs trading is simple and seductive — and you’ve probably read this pitch a hundred times, so I’ll keep it quick. You find two things that move together — two bond ETFs, gold and silver, crude oil at two different delivery months — and you trade the gap between them instead of betting on direction. When the gap stretches too wide, you bet it snaps back. You don’t care if the market goes up or down; you only care that the rubber band returns to its resting length.

A mean-reverting spread oscillating around its resting level with entry bands

You trade the gap, not the direction — when the spread stretches past the entry band, you bet it relaxes back to the mean

The hard part is that little word together. Two lines on a chart can look like dance partners for a year and then wander off in opposite directions forever. So the framework I built is really one long, suspicious interrogation of that word:

Cointegration (Johansen test / VECM): the statistical question of whether two price series are genuinely tethered, not just briefly correlated. Correlation is a fair-weather friend; cointegration is supposed to be the real marriage. (see here for reference).
A Kalman filter to track the hedge ratio — how many units of B you short against A — as it drifts over time, instead of freezing it once and hoping.
An Ornstein–Uhlenbeck fit to measure the spread’s half-life: how many days it typically takes the rubber band to relax halfway back. Short half-life means fast, frequent, tradeable. Long half-life means you’ll grow old holding the position — and trust me, you don’t want that one.
A Bertram optimal threshold (a*): the math (Bertram, 2010) for where to enter (instead of using traditional fixed Z-score) — the deviation that maximizes your Sharpe per unit of time, rather than a hand-waved “two standard deviations.”

And then — the part that ended up mattering the most — a set of drift filters. Before the framework trusts a spread at all, it asks three more uncomfortable questions — the kind designed to make the data confess.

The three drift filters: drift_test, Hurst exponent, and out-of-sample reversion

The three drift filters run after the OU half-life screen. A pair has to pass all three or it gets thrown out — no matter how pretty the cointegration looked

Is the spread quietly trending instead of oscillating (a HAC drift_test)? Is it persistent like a random walk rather than mean-reverting (the Hurst exponent)? And does the equilibrium itself stay put from one month to the next (mu_e_drift)? Fail any one of these and the pair gets thrown out.

I ran this whole machine on S&P 500 stocks at daily resolution first. It found nothing worth trading. That was the first no — and honestly, fair enough. Single stocks carry their own idiosyncratic, unpredictable noise. So I went looking for cleaner tethers.

Round 1: four ETF pairs, four different ways to fail

ETFs felt like the obvious upgrade. A bond-curve pair, a couple of commodity-country pairs, the classic stock-vs-bond hedge — each one has a story for why the gap should revert. So I tested four, and here’s the thing that still makes me smile bitterly: all four failed, and no two failed the same way.

Pair	Why they should revert	How it actually failed
TLT / IEF	Two points on the same Treasury yield curve	Spread too quiet — the optimizer wanted to enter at noise-level wiggles (`a*` clamped to 1)
EWA / EWC	Two commodity-exporting economies (Australia, Canada)	The commodity cycle dragged the spread into a slow trend — drift filter rejected 100%
GLD / SLV	Gold and silver, the original “physical” arb	`a*` actually unlocked — but a multi-year trending ratio tripped the drift filter
SPY / TLT	The textbook stocks-vs-bonds seesaw	The hedge ratio regime-switched from −0.2 to +2.3 — no stable relationship to trade

Four pairs, four failure modes, one verdict. And here’s why I didn’t find that depressing: the framework wasn’t broken. GLD/SLV proved it — there, the Bertram optimizer did unlock a real entry threshold, which means the machine responds correctly when a spread genuinely has magnitude. It wasn’t rejecting everything reflexively. It was rejecting these spreads for specific, different, correct reasons.

But it left me with a suspect. ETFs are wrappers. They carry roll yield, tracking error, dividend-cash-flow noise — distortions that have nothing to do with the underlying assets and everything to do with the fund structure. What if the drift my filters kept flagging was partly wrapper drift? The most famous cautionary tale here is USO in 2020, the oil ETF that bled out on contango even as the oil it tracked stayed roughly put. If the wrapper was the villain, the fix was obvious: drop the wrapper. Trade the actual contracts.

Round 2: futures, and the promise of physical arbitrage

This is the part I was genuinely excited about. With commodity futures you get something equities and ETFs simply can’t offer: an arbitrage that real humans with refineries, crushers, and storage tanks are economically forced to defend.

Crack spread diagram: crude oil into a refinery out to gasoline, with the spread equal to the refinery margin

The crack spread is the refinery’s margin. When it gets out of line, real money piles in to pull it back — the reversion isn’t a statistical coincidence, it’s produced by people whose jobs depend on it

A refinery buys crude and sells gasoline, so the crack spread between them is literally their margin — and when it gets out of line, real money piles in to pull it back. That’s supposed to be the one corner of the market where edge survives. So I went after the two Tier-1 spreads:

The calendar spread — crude oil now vs. crude oil three months out. The tether is the cost of carry (storage + financing).
The crack spread — gasoline vs. crude. The tether is the refinery margin itself.

Before any strategy code, I spent an afternoon on plumbing on the QuantConnect backtest platform. Once the data was clean, I ran a quick gross-of-cost probe on each spread, looking only at the statistics. Both looked promising: strong cointegration, a tradeable half-life, and — for the crack — a rich, fat spread that swings 20–50%. By every reversion metric, this was the best-looking setup I’d seen in the whole project. All of a sudden, my confidence of finding an edge was through the roof.

And then I ran them through the full framework.

The plot twist: clean reversion around a moving target

Both spreads reverted beautifully. And both got rejected anyway. Here’s the funnel for the crack spread — how many pairs survive each filter:

Screening funnel for the crack spread showing 40 candidates narrowing to 1 tradeable pair

40 cointegrating candidates go in; exactly one survives every filter. The two coral stages — drift_test and mu_e_drift — do almost all the killing

Look where the cliffs are. The spread passes the “does it revert?” tests with flying colors — the Hurst exponent came in around 0.09 (anything under 0.5 is mean-reverting; 0.09 is emphatically mean-reverting), and it snapped back inside the out-of-sample window almost every time. The reversion is real. It is, if anything, cleaner than the calendar spread.

But two filters do almost all the killing, and they’re the same two for both spreads: the drift_test (the spread is quietly trending inside the window) and mu_e_drift (the spread’s resting point jumps around from month to month). In plain English:

These commodity spreads mean-revert cleanly — but around a center that keeps moving. The rubber band snaps back reliably; the peg it’s tied to slides across the table. A strategy that assumes a fixed mean gets run over, and the drift filters correctly refuse to play.

Two-panel comparison of a fixed mean versus a drifting center that a fixed-mean model gets run over by

Left: what a fixed-mean model assumes. Right: what the commodity spreads actually do — clean reversion around a centre that slides. Same clean snapping-back, completely different tradeability

That’s the same fingerprint I saw on the trending ETF pairs. Escaping the wrapper didn’t change the answer — which tells me the wrapper was never really the villain. The villain is structural: a fixed-mean mean-reversion model meeting spreads whose equilibrium genuinely drifts, at daily resolution.

Along the way I confidently predicted that the crack’s rich spread would “unlock” the Bertram threshold (a*) the way it never did for the quiet bond ETFs. I was wrong — and wrong in an instructive way. When you run gross-of-cost (zero fees, to isolate raw edge), the cost term in the Bertram math goes to zero, and the optimizer pins the threshold to its floor regardless of how fat the spread is. The “clamp” I’d been reading as a signal about spread size was, in that mode, just an artifact of turning costs off. The fat spread was real; my interpretation of that one number was not. Good frameworks catch your data errors. They don’t catch your overconfidence — that one’s on me.

Why “no” is the actual product

It would be easy to read this as four weeks of failure. Every one of these rejections was specific and correct. Too-quiet spread. Commodity-cycle drift. Regime-switching hedge. Drifting equilibrium. The framework never once gave me a false green light, never sold me a backtest that would’ve melted in production. In a domain where the default failure mode is fooling yourself with an overfit equity curve, a tool whose strongest feature is a well-reasoned no is not a consolation prize. It’s the whole asset.

Where this leaves me

I went looking for a commodity arbitrage and came back with four ETF rejections, two futures rejections, and a much sharper question than the one I started with. The holy grail of an ever-profitable mean-reversion strategy stayed exactly as hidden as it was before. But I trust my map of where it isn’t a great deal more — and I trust the instrument I’m using to draw that map.

Maybe that’s the quiet lesson in all of this. The market is always happy to hand you a reason to trade. However, all those movements and mechanisms hidden behind the curtain are actually the killers that wear out your profit and make your strategy bleed. And through the futures backtest results that I obtained, I now have something better than an edge: a precise description of why there isn’t one here, which points me directly to what to try next.

A moving-center model. The spreads revert — that part is real and strong. The fixed mean is what fails. So instead of measuring deviations from a frozen average, track the center as it moves (a Kalman-filtered mean, or a detrended spread) and trade deviations from that. This is a genuine redesign, not a tweak — but those clean Hurst numbers are practically begging for it.
Minute resolution. The center drifts over days and months. Zoom in to intraday bars and that drift nearly vanishes within a session — the mean is approximately stationary for the few hours you’d hold. The exact same framework that rejected everything at daily resolution might finally find an equilibrium that sits still long enough to trade. That’s the next experiment.

I hope you enjoyed following the dead ends as much as I enjoyed walking them. If you’ve got a sharper idea for handling a drifting equilibrium, let me know — I’d love to hear it. On to minute bars.

References

Bertram, W. K. (2010). Analytic solutions for optimal statistical arbitrage trading. Physica A: Statistical Mechanics and its Applications, 389(11), 2234–2243.
Engle, R. F., & Granger, C. W. J. (1987). Co-integration and error correction: Representation, estimation, and testing. Econometrica, 55(2), 251–276.
Johansen, S. (1991). Estimation and hypothesis testing of cointegration vectors in Gaussian vector autoregressive models. Econometrica, 59(6), 1551–1580.
Uhlenbeck, G. E., & Ornstein, L. S. (1930). On the theory of the Brownian motion. Physical Review, 36(5), 823–841.
Hurst, H. E. (1951). Long-term storage capacity of reservoirs. Transactions of the American Society of Civil Engineers, 116, 770–799.
Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82(1), 35–45.
CQF — Certificate in Quantitative Finance, CQF Institute (Fitch Learning).

Michael's blog

Is There Still Edge in Pairs Trading? My Framework Said "Not Here" — Three Times

What testing cointegration and mean reversion across stocks, ETFs, and commodity futures taught me about statistical arbitrage