The Probability of Backtest Overfitting: Did Your Search Beat a Coin Flip?

Part of the "Backtests Without Illusions" series.

📄 This article grew into a research paper. Every number below comes from one deterministic script that builds controlled ground truth — zero-edge searches, planted-edge searches, and a real moving-average parameter grid on a random walk — then runs Combinatorially Symmetric Cross-Validation (CSCV) to estimate the Probability of Backtest Overfitting against it, measuring directly how well the selection procedure generalizes. Read the paper online (interactive version + PDF) at pbo-search.marketmaker.cc, code and data at github.com/suenot/pbo-search.

The Deflated Sharpe Ratio put your winner on trial: given that you searched N configurations, is this one Sharpe beyond what luck buys? This article puts something else on trial — the act of picking. You ran a grid, you kept the best cell, you moved on. But was the selection itself trustworthy? If you re-ran the whole in-sample/out-of-sample split a different way, would the same configuration still come out on top — or did you just crown the luckiest of a hundred coins?

The Probability of Backtest Overfitting (PBO), introduced by Bailey, Borwein, López de Prado & Zhu (2017), answers exactly that question, and it does so with a number most people misread on sight. Here is the single most important sentence in this article, so read it twice:

PBO's null is 0.5, not 1. A search with no out-of-sample skill scores PBO ≈ 0.5. Half is not "half overfit" — half is fully overfit, a coin flip. You want PBO near zero.

That trips everyone. We are trained to read probabilities against a null of "nothing," and for overfitting our gut says the "innocent" reading is 0. It is not. PBO is the probability that the configuration you selected as best in sample lands in the bottom half of the field out of sample. If your search has genuinely learned nothing that generalizes, the in-sample winner is, out of sample, equally likely to be anywhere in the ranking — so it falls into the bottom half about half the time. PBO ≈ 0.5 means your selection procedure is a coin flip. PBO ≈ 0 means the in-sample winner reliably stays a winner out of sample — the selection is trustworthy. Everything below is built to make that one calibration fact concrete, on data where we know the ground truth.

Regime (200 configs, T = 1000, S = 16)	What it is	In-sample Sharpe of the winner	Out-of-sample Sharpe	PBO	Verdict
Zero-edge field (200 iid noise strategies)	pure luck, no edge anywhere	1.98	0.06	0.476	overfit — a coin flip
Planted edge (20 configs carry annualized Sharpe 2.38)	genuine, robust skill	3.73	2.34	0.001	trustworthy
MA-crossover grid on a pure random walk (170 configs)	a tempting mirage	0.97	0.04	0.463	overfit — a coin flip

Sharpe ratios annualized ×√252. All three rows average the selected strategy's Sharpe over 60 Monte-Carlo matrices — apples to apples, so the overfit grid is scored the same way as the null and the planted edge. On this averaged footing the grid's selected in-sample Sharpe (0.97) is actually lower than the null's inflated 1.98, its out-of-sample Sharpe is a slightly positive 0.04, and its PBO (0.463) sits just below ½ — statistically indistinguishable from the null. The dramatic single-matrix numbers (a best-in-grid in-sample Sharpe of 2.33 collapsing to a median out-of-sample −0.22, PBO 0.573) belong to one representative random-walk seed and appear, clearly labeled, in Act 4. Every number traces to the results file.

Three regimes, one lesson. A no-edge search sits at the 0.5 coin-flip line whether the noise is iid (PBO 0.476) or dressed up as a real moving-average grid (PBO 0.463) — the two are statistically indistinguishable, and both are damning. A genuine edge drops PBO to 0.001. Averaged over matrices the grid's selected winner is unremarkable — an in-sample Sharpe of 0.97, below the null's inflated 1.98 — which is itself the honest diagnosis: a no-edge search reads as the null. The drama lives in the tail. On one representative random-walk matrix (Act 4) the grid's best cell posts an in-sample Sharpe of 2.33 — essentially equal to the planted edge's out-of-sample 2.34, a dead heat — yet out of sample it lands in the bottom half about as often as the top. That gap between a gorgeous backtest and a worthless selection is invisible in the winner's own Sharpe and visible only when you score the procedure. That is what PBO does.

Act 1 — The procedure on trial: what CSCV actually does

A tall performance matrix of one thousand rows and two hundred strategy columns being sliced horizontally into sixteen equal blocks, with eight blocks routed to a training panel and the other eight to a testing panel, a single in-sample-best column highlighted and an arrow tracing where that same column lands in the out-of-sample ranking

DSR is parametric: it models the distribution of the maximum Sharpe under a null and deflates the winner's significance analytically. CSCV is the non-parametric answer to the same selection-bias problem — instead of modeling the maximum, it resamples the train/test split every way it can and watches, empirically, whether the in-sample winner keeps winning. No distributional assumption, no counting of "effective trials." Just: does the choice generalize?

Start with the raw material. You backtested N = 200 configurations of a strategy class over T = 1000 synchronous observations. Stack each configuration's return series into a column and you get a T × N performance matrix M — 1,000 rows of time, 200 columns of strategy. This is the only input CSCV needs.

Now the construction, in four moves:

Split time into S = 16 disjoint blocks of equal length (T/S rows each). Blocks preserve local time structure — a design choice that matters the moment returns have memory.
Choose every way to use half the blocks to train and half to test. With S = 16, that is all C(16, 8) = 12,870 ways to pick 8 of 16 blocks as the training set; the other 8 are the test set. This is where "combinatorially symmetric" comes from: each split has a mirror (swap train and test), so the scheme uses your data symmetrically instead of the one privileged past→future cut a single walk-forward gives you.
On each split, rank all 200 configurations by in-sample Sharpe and pick the winner n*. Then find where that same configuration n* ranks out of sample, on the held-out 8 blocks.
Record the winner's relative out-of-sample rank and turn it into a logit. PBO is the fraction of the 12,870 splits where that logit is ≤ 0.

The enumeration is tiny to write:

from itertools import combinations

combos = list(combinations(range(S), S // 2))   # C(16, 8) = 12,870 splits

For each split, let $\bar r^{\,c}_{n^*}$ be the out-of-sample rank of the in-sample winner among the $N$ configurations (rank 1 = worst, $N$ = best). Normalize it to a relative rank $\bar\omega_c \in (0,1)$ , take its logit $\lambda_c$ , and integrate over splits:

$\bar\omega_c = \frac{\bar r^{\,c}_{n^*}}{N+1}, \qquad \lambda_c = \ln\!\frac{\bar\omega_c}{1 - \bar\omega_c}, \qquad \text{PBO} = \frac{1}{\#C_S}\sum_{c \,\in\, C_S} \mathbf{1}\{\lambda_c \le 0\}$

The logit is just a convenient ruler. $\lambda_c > 0$ means the winner landed in the top half out of sample (relative rank above ½) — in-sample/out-of-sample consistency, good. $\lambda_c \le 0$ means it landed at or below the out-of-sample median — the in-sample choice did not generalize on that split. PBO is the fraction of splits where the in-sample winner failed to beat the median out of sample. The whole matrix determines it: given M and S, PBO is deterministic — no resampling seed, all 12,870 splits are enumerated exhaustively.

In code, once you have the in-sample and out-of-sample Sharpe of every configuration on every split (matrices R_tr and R_te, each 12,870 × 200), the heart of the estimator is six lines:

n_star  = R_tr.argmax(axis=1)                      # in-sample winner, per split
oos_sh  = R_te[rows, n_star]                        # that winner's OWN out-of-sample Sharpe
rank    = (R_te <= oos_sh[:, None]).sum(axis=1)     # its OOS rank among N configs, 1..N
omega   = np.clip(rank / (N + 1.0), 1e-6, 1 - 1e-6) # relative OOS rank in (0,1)
lambdas = np.log(omega / (1.0 - omega))             # logit

pbo = float(np.mean(lambdas <= 0.0))                # fraction of splits with lambda <= 0

Notice what is not here: no p-value, no threshold on the winner's Sharpe, no model of the null distribution. PBO never asks whether the winner is good. It asks whether picking the in-sample best is a decision that survives contact with held-out data. That is a property of your search, not of your strategy — which is precisely why it catches things the winner's own statistics cannot.

Act 2 — Calibration is the whole argument: the null is 0.5

A horizontal PBO dial running from zero on the left to one on the right with a bright danger line drawn at exactly one-half labeled the overfitting line, a spinning coin balanced on that midpoint, and a zero-edge strategy's tall in-sample Sharpe bar collapsing to a flat near-zero bar out of sample

A diagnostic you cannot calibrate is a rumor. So before trusting PBO on anything real, pin down two endpoints on data where the answer is known: a field with no edge anywhere, and a field with a genuine edge. If PBO doesn't land near 0.5 on the first and near 0 on the second, it is worthless.

The null endpoint. Build M from 200 columns of independent, zero-drift, zero-edge Normal noise — true Sharpe exactly 0 for every configuration — and run CSCV. Average over 60 such matrices. The selected (in-sample-best) strategy posts an average in-sample annualized Sharpe of 1.98. That is not a small number; it is the same selection inflation the DSR article measured — the best of 200 noise columns looks like a fundable strategy. Out of sample, that same winner delivers an annualized Sharpe of 0.06. It gave essentially all of it back. And the verdict on the procedure:

$\text{PBO}_{\text{null}} = 0.476 \quad (\pm\, 0.137)$

That is the coin flip, measured. Across the 12,870 splits, the in-sample winner is as likely to land below the out-of-sample median as above it — 0.476, a hair under ½, indistinguishable from 0.5 given the Monte-Carlo spread. The companion diagnostic agrees: the probability that the selected strategy's out-of-sample Sharpe is negative is 0.475 — pick the in-sample best out of pure noise and it loses money out of sample about half the time. There is no skill in the selection because there is no skill to find, and PBO reports exactly that: 0.5 is the overfitting line, and pure noise sits on it.

Why 0.5 and not 1? Because under a true null all 200 columns are exchangeable — statistically interchangeable draws from the same noise process. The in-sample winner is special only in sample; out of sample it is just another column, equally likely to rank anywhere. So its relative out-of-sample rank $\bar\omega_c$ is uniform on $(0,1)$ , the logit $\lambda_c$ is symmetric around 0, and the fraction with $\lambda_c \le 0$ converges to ½. A PBO of 1 would be worse than a coin flip — it would mean in-sample success reliably predicts out-of-sample failure, which needs an active anti-persistence mechanism, not mere absence of edge (more on that in the honesty notes).

The edge endpoint. Now build a field where 20 of the 200 configurations carry a real, planted edge — a per-observation Sharpe of 0.15, which annualizes to 2.38 (derived: $0.15 \times \sqrt{252} \approx 2.38$ ) — and leave the other 180 as noise. Run the identical CSCV. The story inverts completely:

	In-sample Sharpe (ann.)	Out-of-sample Sharpe (ann.)	PBO	P(OOS loss)
Null (0 edge)	1.98	0.06	0.476	0.475
Planted edge (Sharpe 2.38)	3.73	2.34	0.001	0.0006

The planted-edge winner posts an in-sample annualized Sharpe of 3.73 — inflated by selection, as always — but this time it keeps an out-of-sample 2.34, and PBO collapses to 0.001. Across all 12,870 splits, the in-sample winner falls into the bottom half out of sample essentially never. The probability of an out-of-sample loss drops to 0.0006. This is what a trustworthy selection procedure looks like: whichever way you cut train against test, the same kind of configuration keeps winning, because there is a real, robust effect there for the search to lock onto. The two endpoints — 0.476 on noise, 0.001 on a genuine edge — are the calibration. PBO works.

Act 3 — A continuous thermometer, not a yes/no test

A downward-sloping thermometer curve where the horizontal axis is the strength of a real planted edge and the vertical axis is PBO, the curve falling smoothly from about one-half at zero edge down toward zero as the edge grows, with a mirror curve of the selected strategy's out-of-sample Sharpe rising in lockstep

Two endpoints prove PBO can tell noise from edge. But the deeper property is that it does so smoothly. Sweep the planted edge from nothing to strong and PBO does not snap from 0.5 to 0 — it slides down a monotone ramp, and the selected strategy's out-of-sample Sharpe rises to meet it, step for step:

Planted true Sharpe (annualized)	PBO	Selected strategy's OOS Sharpe (ann.)
0.00	0.52	−0.05
0.48	0.44	0.19
0.95	0.21	0.81
1.59	0.03	1.65
2.38	0.001	2.48
3.17	0.00	3.29

Read the two data columns together. At zero true edge PBO is 0.52 and the selected strategy earns −0.05 out of sample — the coin flip, again, and a winner that loses money. Add a whisper of edge (annualized 0.48) and PBO ticks down to 0.44. By an annualized true Sharpe of 0.95 — a genuinely modest, believable edge — PBO is already 0.21 and the out-of-sample Sharpe has climbed to 0.81. At 1.59 it is 0.03; at 2.38, 0.001; at 3.17, effectively 0.00, with the selected strategy carrying a 3.29 out of sample. PBO falls monotonically as the real edge grows, and the winner's out-of-sample performance rises in lockstep — the two are the same fact seen from two sides.

This is the property that makes PBO usable in practice: it is a continuous overfitting thermometer, not a binary alarm. A PBO of 0.21 doesn't just say "not overfit" — it says your selection has partial out-of-sample skill: the in-sample winner beats the out-of-sample median 79% of the time, but the edge is thin enough that a fifth of the splits still bury it. You can watch the number move as you strengthen your signal, tighten your universe, or prune your grid, and know which direction is honest. The paper's own rule of thumb — reject when PBO exceeds 0.05 — falls naturally out of this ramp: below annualized Sharpe ~1.5 the search hasn't cleared it; above ~1.6 it has. But the ramp itself is more informative than any single cutoff, because it tells you not just whether you overfit but how close to a coin flip you are.

Act 4 — The realistic trap: a beautiful backtest, certified worthless

A glowing moving-average crossover parameter grid on a pure random walk showing one tempting bright cell at an in-sample Sharpe of 2.33, beside a scatter of that same winner's out-of-sample results centered below zero, with a PBO gauge pinned near one-half and a caption reading certified worthless

The iid-noise null is honest but easy to dismiss — "my strategies aren't random Normal columns." So here is the trap in the shape practitioners actually walk into. Take a moving-average crossover, the most-backtested rule in the world: go long when a fast MA crosses above a slow MA, flat otherwise. Grid it — 10 fast lengths $\times$ 17 slow lengths, keeping the valid fast-below-slow pairs, for K = 170 configurations. Now run that grid on a series with provably zero edge: a pure random walk. There is nothing to find. A crossover cannot predict a random walk. We know the answer is "no strategy."

The grid does not know that. It hands you a winner, and the winner is tempting:

Diagnostic (one representative random-walk matrix, seed 3000, K = 170, S = 16)	Value
Best in-sample Sharpe (annualized)	2.33
PBO	0.573
Median out-of-sample Sharpe (annualized)	−0.22
Probability of an out-of-sample loss	0.63
Out-of-sample-vs-in-sample degradation slope	−0.92
Median logit $\lambda$	−0.25

This is one seeded matrix. Averaged over 60 independent random-walk matrices these same diagnostics read PBO 0.463 ± 0.223, a selected in-sample Sharpe of 0.97 decaying to 0.04, and P(OOS loss) 0.47 — statistically indistinguishable from the null. Seed 3000's 0.573 is one draw on the high side of the ~0.5 null band — sampling noise around the coin-flip value, well inside the ±0.223 matrix-to-matrix spread — and the story is identical either way.

An in-sample annualized Sharpe of 2.33 on a moving-average crossover is the kind of result that ends up in a pitch deck. It is essentially equal to the out-of-sample Sharpe of our genuinely-planted edge from Act 2 (2.34 — a dead heat). If you stopped at the backtest, you would fund it. CSCV refuses. PBO is a coin flip here: 0.463 averaged over the 60 matrices, 0.573 on this particular one — both say the search has no out-of-sample skill. Do not over-read the 0.573: it sits 0.073 above ½, sampling noise around the 0.5 null and well inside the ±0.223 matrix-to-matrix band; a PBO genuinely above 0.5 — where in-sample success would actively predict out-of-sample failure — needs an anti-persistence or trading-cost structure this random walk does not contain (see the honesty notes). On this matrix the median logit of −0.25 puts the median in-sample winner at a relative out-of-sample rank of about 0.44 (derived: $1/(1+e^{0.25})$ ) — roughly 75th out of 170 (derived: $0.44 \times 171$ ), just below the middle of the field it was supposed to lead. The median out-of-sample Sharpe of that winner is −0.22 — negative — and it takes an out-of-sample loss 63% of the time. A backtest Sharpe of 2.33 whose out-of-sample expectation is a loss: the definition of a mirage.

The degradation slope of −0.92 is the second knife. Regress each split's out-of-sample Sharpe of the selected winner on its in-sample Sharpe; the slope is steeply negative — the better a configuration looks in sample, the worse it does out of sample. This is the fingerprint of overfitting on a series with memory: the crossover latches onto transient patterns in the training blocks that, being artifacts of a random walk, reverse out of sample. One subtlety worth stating so you don't over-read the slope: a negative slope is not itself a verdict. Even the genuine-edge regime from Act 2 has a negative degradation slope (−0.52) — regression to the mean always pulls the selected maximum down a little out of sample. What separates the mirage from the real edge is not that the slope is negative but where the winner lands: the genuine edge stays near the top (PBO 0.001) while giving a little back; the mirage sits on the coin-flip line (PBO 0.463 averaged, 0.573 on this seed), its winner no more likely to be above the out-of-sample median than below. Read the slope for how much shrinkage; read PBO for whether it still generalizes. The mirage fails on both.

This is why PBO earns its place next to a raw backtest. The in-sample Sharpe of 2.33 is not a lie — the strategy really did earn it, in sample, on that random walk. It is selection, dressed in a familiar rule on a realistic-looking grid, and no amount of staring at the equity curve reveals it. Only scoring the procedure does.

Act 5 — PBO and DSR: two honest questions, one plateau

Two complementary measuring instruments aimed at the same parameter search from different angles, one labeled PBO asking did the selection procedure overfit reading the whole train-test resampling, the other labeled DSR asking is this one Sharpe beyond luck deflating a single winning bar

PBO and the Deflated Sharpe Ratio are the two halves of the same honesty check, and they are not redundant — they interrogate different objects:

	Deflated Sharpe Ratio (DSR)	Probability of Backtest Overfitting (PBO)
Object on trial	the winner	the selection procedure
Question	is this Sharpe beyond what luck buys across N trials?	does picking the in-sample best generalize out of sample?
Method	parametric — deflate the significance threshold	non-parametric — resample all C(S, S/2) train/test splits
Null value	DSR ≈ 0.5 (winner just matches the noise ceiling)	PBO ≈ 0.5 (winner is a coin flip out of sample)
You want	DSR near 1	PBO near 0
Needs the trial count N?	yes — and correlated grids make N ambiguous	no — the split resampling handles dependence natively

They can even disagree, and the disagreement is diagnostic. DSR can be fooled by a correlated grid into over-deflating (the trap the DSR article's final act is entirely about — 640 correlated cells are not 640 independent trials, and feeding the raw count over-inflates the noise ceiling). PBO never counts trials; it resamples the actual return matrix, so grid correlation is baked into the splits for free. Conversely, PBO tells you the procedure generalizes but not whether the winner clears a hurdle rate — a search can have low PBO and still select something whose out-of-sample Sharpe, while reliably above the field median, is too small to trade. DSR prices the winner; PBO prices the procedure. Run both.

Two three-dimensional parameter-grid surfaces side by side: on the left a broad smooth plateau of good neighboring configurations that every train-test split agrees on giving low PBO, on the right a single lonely spike surrounded by flat noise that different splits disagree about giving PBO near one-half

There is a geometric intuition underneath all of this, and it is the most useful thing to carry away. A genuine edge is a plateau; overfitting is a spike. When a real effect drives your grid, the good configurations cluster — fast=3/slow=55 works, and so do its neighbors, because they are all sampling the same underlying signal. That plateau is robust to resampling: whichever 8 of 16 blocks you train on, the in-sample winner is drawn from the same broad region, and that region is still on top out of sample. Many splits agree → low PBO. When overfitting drives your grid, the "winner" is a lonely spike — one cell that happened to fit the training blocks' noise, surrounded by mediocre neighbors. That spike is fragile: a different train/test split crowns a different lonely spike, and none of them survives to the test set. Splits disagree → PBO ≈ 0.5. This is the same lesson our plateau-analysis study reaches from the parameter-map side; PBO is, in effect, the plateau-vs-spike distinction measured across every symmetric resampling of your data at once.

It also explains why CSCV beats the practitioner-default walk-forward split. Walk-forward gives you one past→future cut and one verdict; CSCV gives you 12,870 symmetric cuts and asks whether the winner survives all of them. A spike can survive one arbitrary cut by luck; it cannot survive 12,870. (López de Prado's Combinatorial Purged Cross-Validation, CPCV, extends exactly this idea with purging and embargoing to kill the label-leakage that plain resampling can suffer under serial dependence — the natural next rung once your labels overlap.) The same structural warning threads through the whole series: the metric you optimize secretly picks your strategy (objective-function design), a one-bar leak manufactures a Sharpe of 15 from noise (look-ahead bias), a multiple-testing search manufactures a Sharpe of 1.63 from noise (DSR) — and here, a resampled selection procedure manufactures a worthless winner that only PBO can expose.

Honesty notes

Four caveats, stated plainly, because a controlled study earns its conclusions only by naming its limits.

The data-generating processes are synthetic — on purpose. iid Normal noise for the null, a planted-Sharpe field for the edge sweep, and a moving-average grid on a pure random walk for the trap. None is a claim about market realism; each is chosen for controlled ground truth. We can only prove PBO reads 0.5 on "no skill" and 0 on "real skill" by generating data where we know which is which. Real returns are fat-tailed, autocorrelated, and non-stationary; the deliverable here is the calibrated diagnostic, not a strategy.
PBO's null is 0.5, and that is a feature, not a quirk. State it every time you report a PBO, because half your readers will otherwise treat 0.5 as "half-safe." A no-out-of-sample-skill search sits at 0.5; a genuine edge drives it to 0. There is no "innocent" reading of PBO ≈ 0.5 — it is the fully-overfit verdict.
PBO > 0.5 is a "perverse" region we do not force. A PBO systematically above 0.5 means in-sample success actively predicts out-of-sample failure — the IS-worst configurations become the OOS-best. That requires an anti-persistence or trading-cost structure, not mere absence of edge. Our overfit searches sit at ≈ 0.5 (0.476 for iid noise; 0.463 averaged for the MA grid; 0.573 on one high-side seed, within the ±0.14–0.22 Monte-Carlo band across 60 matrices), which already means "no out-of-sample skill." We do not manufacture the perverse region; we only show that overfitting lands you on the coin-flip line, which is damning enough.
PBO is deterministic given the matrix; only the matrix is random. For a fixed M and S = 16, all C(16, 8) = 12,870 splits are enumerated exhaustively — there is no bootstrap seed and no sampling variance in PBO itself. The spread we report (±0.137 on the null, ±0.223 on the MA grid) is variance across the 60 Monte-Carlo matrices, not within the estimator. The Sharpe on each side is estimated on about 500 observations — 496 after CSCV block truncation, since T = 1000 divided into 16 equal blocks leaves 992 usable rows, split into two halves of 496; because Sharpe is order-invariant, the row order within a train or test set does not matter (it would, for path-dependent metrics like a return/drawdown ratio).

Takeaways

PBO scores the selection procedure, not the winner — and its null is 0.5. It is the probability that the configuration you picked as best in sample lands in the bottom half out of sample. PBO ≈ 0.5 is a coin flip (fully overfit); PBO ≈ 0 is a trustworthy selection. You want it near zero, and you must say so out loud, because 0.5 reads as "safe" to an untrained eye and means the exact opposite.
Calibration proves it works. On 200 iid zero-edge strategies the best in-sample annualized Sharpe of 1.98 collapses to 0.06 out of sample and PBO = 0.476 — noise sits on the coin-flip line, losing money out of sample 47.5% of the time. Plant a genuine edge (annualized Sharpe 2.38) and the in-sample 3.73 survives to an out-of-sample 2.34 while PBO drops to 0.001. Two endpoints, one calibrated diagnostic.
PBO is a continuous thermometer. Sweep the planted edge and PBO falls monotonically — 0.52 → 0.44 → 0.21 → 0.03 → 0.001 → 0.00 at annualized true Sharpes of 0.00 / 0.48 / 0.95 / 1.59 / 2.38 / 3.17 — with the selected strategy's out-of-sample Sharpe rising in lockstep (−0.05 up to 3.29). It measures how close to a coin flip you are, not just yes/no.
The realistic trap is the whole point. A 170-config moving-average grid on a pure random walk averages a selected in-sample Sharpe of just 0.97 decaying to 0.04, with PBO 0.463 — statistically indistinguishable from the null, a no-edge search reading as the null. On one representative matrix the mirage is vivid: a best in-sample Sharpe of 2.33 (a pitch-deck number), a median out-of-sample Sharpe of −0.22, a 63% chance of an out-of-sample loss, PBO 0.573, and a steep degradation slope of −0.92. A beautiful backtest with a negative out-of-sample expectation, invisible to every statistic printed next to the winner and visible only when you score the procedure.
Pair PBO with the Deflated Sharpe Ratio. DSR prices the winner (is this Sharpe beyond luck, given N trials?); PBO prices the procedure (does the selection generalize?). DSR needs a trial count and can be fooled by correlated grids; PBO resamples the matrix and never counts trials. A genuine edge is a broad plateau many splits agree on (low PBO); a lonely in-sample spike is overfit (splits disagree, PBO ≈ 0.5). Run both, and read the plateau.

The winner of a search is guilty until proven innocent — and PBO cross-examines the search itself, not the alibi it hands you. It ignores how good the winner looks in sample and asks only whether picking it was a decision that survives being re-cut 12,870 ways. When it doesn't — when your gorgeous 2.33 Sharpe turns out to land in the bottom half out of sample as often as not — you have not found a strategy. You have found the luckiest coin, and PBO is the number that catches it flipping.

The full experiment — the null-calibration harness, the planted-edge thermometer sweep, the random-walk grid trap, and every number in this article regenerable from one deterministic script — is in the companion paper at pbo-search.marketmaker.cc, with code and data at github.com/suenot/pbo-search.

The Probability of Backtest Overfitting: Did Your Search Beat a Coin Flip?

Act 1 — The procedure on trial: what CSCV actually does

Act 2 — Calibration is the whole argument: the null is 0.5

Act 3 — A continuous thermometer, not a yes/no test

Act 4 — The realistic trap: a beautiful backtest, certified worthless

Act 5 — PBO and DSR: two honest questions, one plateau

Honesty notes

Takeaways

Authors

Read More

Objective-Function Design: The Metric You Optimize Secretly Picks Your Strategy

The Deflated Sharpe Ratio: How Many of Your Backtest 'Winners' Survive Multiple Testing?

Look-Ahead Bias: How a One-Bar Mistake Manufactures a Sharpe of 15 From Pure Noise

Act 1 — The procedure on trial: what CSCV actually does

Act 2 — Calibration is the whole argument: the null is 0.5

Act 3 — A continuous thermometer, not a yes/no test

Act 4 — The realistic trap: a beautiful backtest, certified worthless

Act 5 — PBO and DSR: two honest questions, one plateau

Honesty notes

Takeaways

Authors

Read More

Objective-Function Design: The Metric You Optimize Secretly Picks Your Strategy

The Deflated Sharpe Ratio: How Many of Your Backtest 'Winners' Survive Multiple Testing?

Look-Ahead Bias: How a One-Bar Mistake Manufactures a Sharpe of 15 From Pure Noise

Mantente a la vanguardia

¡Éxito!

Sign In