Look-Ahead Bias: How a One-Bar Mistake Manufactures a Sharpe of 15 From Pure Noise

Part of the "Backtests Without Illusions" series.

📄 This article grew into a research paper. Three subtle look-ahead leaks are put to a controlled test against known ground truth (4,000 simulated histories). Read the paper online (interactive version + PDF) at lookahead.marketmaker.cc, code and data at github.com/suenot/lookahead-inflation.

A few weeks ago our parameter-search benchmark was lying to us, and we almost didn't notice.

The engine looked clean. Closed-bar logic, an honest rolling walk-forward split, a Sobol/QMC search over the parameter space, a held-out test window. The search found configurations that looked good in-sample. The only problem: out-of-sample, almost everything was negative. We assumed the strategy was simply weak.

Then we found one line. The signal was decided on the close of bar i, but the fill was booked on the same bar i instead of the next bar's open. One off-by-one in the execution index. We moved the fill to open[i+1] — the only price you could actually transact at after seeing the bar-i close — and the out-of-sample result flipped sign. The Sobol search went from a loss to a profit. Nothing about the strategy changed. We had just stopped trading in the past.

That is look-ahead bias, and the unsettling part is how small the mistake was and how large the consequence. This article is a controlled self-audit: we build a simulator where the ground truth is known by construction, inject the subtle leaks one at a time, and measure exactly how much each one inflates the backtest. The headline: with no real edge at all, a same-bar fill manufactures an annualized Sharpe of +14.8 out of pure noise.

What look-ahead bias actually is

The three places look-ahead hides — execution, normalization, and indicators — as channels of unequal danger feeding one trading decision

Look-ahead bias is any point in your pipeline where a decision or a measurement uses information that would not have been available, in real time, at the moment it is used. The textbook examples are coarse — using a stock's full-year earnings in January, or a restatement that wasn't published yet. Those are easy to spot. The ones that survive code review are subtle, and they hide in three places:

Execution — you decide on bar i and fill on bar i (or use bar i's high/low for stops on the very bar that generated the signal). You transact at a price that is correlated with the thing that triggered you.
Normalization — you z-score, min-max, or otherwise scale a feature using statistics computed over the whole series, including the future. The scaler "knows" the test set.
Indicators / features — you smooth or filter with a window that is centered (or otherwise peeks forward), so the value at bar i already contains a piece of bar i+1.

All three are forms of what the machine-learning literature calls leakage: the contamination of training/evaluation with information from the target's future (Kaufman et al., 2012; Kapoor & Narayanan, 2023). In finance the canonical treatment is López de Prado's Advances in Financial Machine Learning (2018) — purged cross-validation, embargoing, the dangers of backtesting. The point-in-time discipline goes back at least to Fama & French (1992), who deliberately lag accounting data by six months so the variable is known before the return it explains.

The question this article answers is quantitative: not "is leakage bad" (everyone agrees) but "how many Sharpe points does each form buy you, and which ones are dangerous?" Without a number you can't reason about it. You can't tell whether a +0.3 inflation is noise or a +14 inflation is a smoking gun.

A simulator with known ground truth

A controlled synthetic market with a known edge dial: a null world with no real edge beside an edge world whose equity genuinely rises

To measure inflation you need to know the truth. Real data never tells you the truth — it gives you one realization and no oracle. So we build a synthetic market where we set the edge.

The data-generating process is strictly causal and non-explosive:

$g_t = \phi\, g_{t-1} + \sqrt{1-\phi^2}\; u_t, \qquad u_t \sim \mathcal{N}(0,1)$

$r_t = a\, g_{t-1} + \sigma\, \varepsilon_t, \qquad \varepsilon_t \sim \mathcal{N}(0,1)$

Here $g_t$ is an exogenous persistent latent drift (an AR(1) with $\phi = 0.95$ ), and the bar return $r_t$ has a small drift $a\,g_{t-1}$ that is known one bar in advance. Because $g$ does not depend on past returns, there is no feedback and nothing explodes. The parameter $a$ is the dial for how much real edge exists:

$a = 0$ — the null: no edge whatsoever. Any positive backtest Sharpe is 100% artifact.
$a > 0$ — a real, tradable edge: an honest momentum rule actually makes money.

The strategy is deliberately simple — a momentum sign rule. The feature is the trailing- $L$ sum of returns ( $L = 24$ bars), and the position is its sign:

csum = np.concatenate(([0.0], np.cumsum(r)))      # csum[k] = sum r[0..k-1]
mom = np.full(n, np.nan)
tt = np.arange(L - 1, n)
mom[tt] = csum[tt + 1] - csum[tt - L + 1]

signal = np.sign(mom)                              # position for the next bar

This momentum feature is the perfect vehicle for studying the same-bar leak, because it has a property real indicators share: it mechanically contains the current bar. mom[t] includes r[t]. So if you book r[t] as your trade, you are partly betting on a quantity that is already inside your own signal. That is the leak, made concrete.

Setup: $\sigma = 0.01$ (1% per-bar volatility), a one-way fee of 0.00045 (round-trip 0.09%, matching our engine), Sharpe annualized by $\sqrt{8760}$ (hourly bars), 4,000 independent histories of 4,000 bars each. Everything is seeded and deterministic.

The honest pipeline (the only tradable one)

Decide on the close of bar t, earn the next bar's return, pay fees on position changes:

def sharpe(sig, ret_booked):
    dpos = np.abs(np.diff(np.concatenate(([0.0], sig))))
    pnl  = sig * ret_booked - FEE_ONEWAY * dpos
    return pnl.mean() / pnl.std() * np.sqrt(8760)

honest = sharpe(signal[idx], r[idx + 1])           # earn r[t+1]: tradable

The three leaks, each a single surgical change

same_bar  = sharpe(signal[idx], r[idx])

z_full    = (mom - mom[valid].mean()) / mom[valid].std()
norm_full = sharpe(np.sign(z_full[idx]), r[idx + 1])

z_sm      = (mom[:-2] + mom[1:-1] + mom[2:]) / 3.0   # uses t-1, t, t+1
indicator = sharpe(np.sign(z_sm[idx]), r[idx + 1])

Each leak is one line away from the honest pipeline. That is the whole point: these are not exotic mistakes, they are the kind of thing that passes review.

Results: the magnitude of each leak

Pure market noise funneled through a same-bar leak into a soaring, fake equity curve and a performance gauge pinned near its maximum

Run across 4,000 seeds, here is the annualized Sharpe each pipeline reports, under the null (no edge) and under a real edge ( $a = 0.0011$ , tuned so the honest Sharpe is a believable +1.57):

Pipeline	Null (no edge)	Real edge
Honest (the truth)	−0.74	+1.57
Same-bar fill	+14.79	+15.85
Indicator peek (1 bar)	+4.76	+6.62
Whole-series normalization	−0.84	+1.46

95% confidence intervals across seeds are ±0.05 or tighter on every cell; paired t-tests on the inflation are astronomically significant where the effect is real (t > 400, p ≈ 0).

Read the null column first, because it is the cleanest possible experiment: there is no edge, so the honest pipeline correctly loses money (−0.74, the drag of paying fees to trade noise). Now look what the leaks do to that same nothing:

Same-bar fill: −0.74 → +14.79. A strategy with zero predictive power, trading random noise, reports an annualized Sharpe of nearly 15. This is not a subtle bias; it is a fabrication. The mechanism is exactly the one we built in: the momentum feature contains r[t], so booking r[t] is betting on your own signal.
Indicator peek: −0.74 → +4.76. Letting the smoother see one bar into the future manufactures a Sharpe near 5 from noise, because the smoothed value at t now correlates with the r[t+1] you are about to earn.
Whole-series normalization: −0.74 → −0.84. Essentially no inflation. This is the honest, non-obvious finding (more on it below).

The edge column delivers the more insidious message. When a real edge does exist (honest +1.57), the leaks don't just add a constant — they push the measured Sharpe to +15.85 and +6.62, far above the +1.57 you could actually trade. So the measured number cannot distinguish skill from leak. A leaked +6 and an honest +6 look identical on the report. You only find out which one you had after you've deployed capital.

The leak is a gradient, not a switch

$The same-bar leak as a smooth dose-response: capturing a larger fraction of the signal bar lifts the equity curve monotonically across the deployable threshold$

A natural objection: "booking the entire signal bar is an extreme, unrealistic mistake." So we swept the dose — the fraction $f$ of the signal bar captured by the leak, from 0 (honest) to 1 (full same-bar):

Capture fraction $f$	Null Sharpe	Edge Sharpe
0.00 (honest)	−0.74	+1.57
0.25	+3.90	+6.41
0.50	+9.86	+12.20
1.00 (full leak)	+14.79	+15.85

Capturing just a quarter of the signal bar takes a no-edge strategy from −0.74 to +3.90. You do not need the full off-by-one to be fooled; a fill that is slightly too favorable — a touch of optimistic slippage on the signal bar, an intrabar stop checked against the bar that triggered it — is enough to clear most "deployable" thresholds. The inflation is smooth and monotone in how much of the present you let yourself trade.

How often does this put a losing strategy into production?

The number that should worry a practitioner is the false-deployment rate: how often a leak makes a truly money-losing configuration clear the bar you would use to greenlight it. Using "annualized Sharpe ≥ 1.0" as the deploy criterion, under the null:

Same-bar fill: 68% of no-edge strategies look deployable and are truly loss-making. Two out of three pure-noise configurations would pass a Sharpe-≥-1 gate and lose money live. (This rate is cleanly defined here because the leak is purely in execution — the honest counterpart is the same signal with an honest fill.)
Indicator peek: it pushes essentially every no-edge configuration over the deploy bar too (99.9% clear Sharpe ≥ 1) — it would wave noise straight into production.
Whole-series normalization: 12% clear the bar — essentially the base rate of noise, no leakage premium.

The taxonomy, and how to detect each one

The three leaks are not equally dangerous, and the differences are instructive.

1. Execution leakage (the expensive one)

The same-bar fill off-by-one: an execution arrow curling back into the bar that generated the signal, versus the honest fill on the next bar's open

Symptom: the fill price is correlated with the signal because they come from the same bar. Magnitude: enormous (+15 from noise at full dose, +3.9 at quarter dose). Why it's the worst: your signal is, almost by definition, built from recent price action, so the signal bar's return is exactly the thing your feature is most correlated with. Booking it is close to looking up the answer.

Detection — the one-bar shift test. This is the single most valuable diagnostic in this article. Take your backtest and shift every fill one bar later (decide on i, fill on open[i+1]). If the result barely moves, your execution was honest. If the result collapses or flips sign, you were trading in the past. This is precisely what happened to our Sobol search: shift the fills, and a "profitable" OOS turned out to be a loss — or rather, the real relationship surfaced once the leak was removed.

entry_price = open_[i + 1]      # NOT close[i], NOT open[i]

2. Indicator / feature leakage (the quiet one)

Symptom: an indicator at bar i depends on data from i+1 or later — a centered moving average, a filter with no causal delay, a peak/trough label that needs future bars to confirm, a Heikin-Ashi-style transform fed future candles. Magnitude: large (+4.8 from noise). Why it hides: the leak is buried inside a library call. scipy.signal.filtfilt is zero-phase — and zero-phase means non-causal. A "this bar is a local maximum" feature is unknowable until the next bar prints.

Detection: for every indicator, ask what is the highest index it reads? If computing the value at t ever touches t+1, it is non-causal. Compute indicators on an expanding/rolling causal window and verify that the value at bar t is identical whether or not bars after t exist in the array. (Our HMA/ADX implementations pass this: every output at t reads only inputs at ≤ t.)

3. Normalization leakage (the channel-specific one)

Symptom: a scaler (StandardScaler, min-max, a global z-score) is fit on the whole dataset, test set included. The canonical ML warnings are explicit about this — Hastie, Tibshirani & Friedman's Elements of Statistical Learning §7.10.2 ("the wrong and right way to do cross-validation"), and scikit-learn's own common-pitfalls guide: "the average should be the average of the train subset, not the average of all the data."

Magnitude in our test: ≈ zero (−0.74 → −0.84). This is the surprising, honest result, and it is worth understanding rather than memorizing.

Why didn't it inflate? Because our strategy uses the feature only through its sign (a zero threshold). Standard-deviation scaling never changes a sign, and global-mean-centering only nudges the zero crossing slightly. So whole-series standardization of a pure sign rule is nearly innocuous.

Do not over-generalize this. Normalization leakage is channel-specific. The moment your strategy uses the feature's magnitude — position sizing proportional to a z-score, a non-zero entry threshold chosen by looking at the scaled distribution, a neural net consuming standardized inputs — the future-aware scaler starts to matter, and it matters more the more the global statistics differ from the causal ones. Our result is not "normalization leakage is safe." It is "leakage magnitude depends on the channel through which the leaked quantity enters the decision, and you should measure it rather than assume it." A sign rule is the one case where this particular leak is cheap.

Where this connects

Look-ahead bias is the first link in a chain this series has been documenting:

It corrupts the input to validation. A leaked backtest will sail through a walk-forward split and look like a broad plateau rather than an overfit peak — the leak is consistent across folds, so cross-validation cannot catch it. Leakage is the failure mode upstream of overfitting, and no amount of honest validation downstream will save you.
It interacts with parameter search: a search over thousands of trials on leaked data will find the configuration that exploits the leak most aggressively. The "winner" is the worst offender.
It is why backtest-live parity diverges. A leak is the cleanest explanation for a 30–50% gap between backtest and bot, because live trading is, mechanically, the one place where you cannot peek.

The discipline that catches all of this is the same one the academic literature has been urging for years: treat a backtest as a statistical experiment with a strict information boundary. Bailey, Borwein, López de Prado & Zhu showed how easily overfitting manufactures fake performance (2014); Arnott, Harvey & Markowitz's backtesting protocol (2019) codifies the hygiene. Look-ahead bias is the most basic boundary of all — the boundary in time — and the cheapest to violate by accident.

Takeaways

The one-bar shift diagnostic: nudging every fill one bar later deflates a fake soaring curve back down to the honest truth

Look-ahead bias is quantitatively huge and qualitatively invisible. A single one-bar execution error turned a Sharpe of −0.74 (pure noise, correctly losing) into +14.79. The mistake is one line; the consequence is a fabricated track record.
It is a gradient. Capturing even 25% of the signal bar yields +3.90 from nothing. You do not need a blatant bug — a little too much optimism in your fills is enough.
The measured number cannot tell skill from leak. When a real edge exists, leaks inflate the report far past the tradable truth. The only defense is the process, not the metric.
The one-bar shift test is your fastest diagnostic. Move every fill one bar later. If performance collapses, you were trading in the past.
Leakage magnitude is channel-specific. Execution and indicator peeks are devastating; whole-series normalization of a sign rule is nearly free. Measure the leak through the channel it actually enters — don't assume.

The full controlled study — all three leaks, the dose sweep, the false-deployment analysis, the formal methods, and every number reproducible from a single deterministic script — is in the companion paper at lookahead.marketmaker.cc, with code and data at github.com/suenot/lookahead-inflation.

The strategy in our null experiment had no edge at all. It still showed a Sharpe of 15. If your backtest looks too good, the first thing to suspect is not your genius — it is your clock.

Look-Ahead Bias: How a One-Bar Mistake Manufactures a Sharpe of 15 From Pure Noise

What look-ahead bias actually is

A simulator with known ground truth

The honest pipeline (the only tradable one)

The three leaks, each a single surgical change

Results: the magnitude of each leak

The leak is a gradient, not a switch

How often does this put a losing strategy into production?

The taxonomy, and how to detect each one

1. Execution leakage (the expensive one)

2. Indicator / feature leakage (the quiet one)

3. Normalization leakage (the channel-specific one)

Where this connects

Takeaways

Auteurs

Lees meer

Walk-Forward Optimization: The Only Honest Strategy Test

Plateau Analysis: How to Distinguish a Robust Optimum from Overfitting

Multi-Symbol Validation: Test Your Strategy on All Pairs

What look-ahead bias actually is

A simulator with known ground truth

The honest pipeline (the only tradable one)

The three leaks, each a single surgical change

Results: the magnitude of each leak

The leak is a gradient, not a switch

How often does this put a losing strategy into production?

The taxonomy, and how to detect each one

1. Execution leakage (the expensive one)

2. Indicator / feature leakage (the quiet one)

3. Normalization leakage (the channel-specific one)

Where this connects

Takeaways

Auteurs

Lees meer

Walk-Forward Optimization: The Only Honest Strategy Test

Plateau Analysis: How to Distinguish a Robust Optimum from Overfitting

Multi-Symbol Validation: Test Your Strategy on All Pairs

Blijf de markt voor

Gelukt!

Sign In