Objective-Function Design: The Metric You Optimize Secretly Picks Your Strategy

Part of the "Backtests Without Illusions" series.

📄 This article grew into a research paper. Every number below comes from one deterministic script that builds controlled ground truth — a synthetic market with a known edge in a moderate signal band and fat-tailed noise everywhere — then runs one threshold search under six different objective functions and measures, out of sample, which strategy each objective actually selects. Read the paper online (interactive version + PDF) at objective-design.marketmaker.cc, code and data at github.com/suenot/objective-design-degeneracy.

You want the best strategy. So you run a search — sweep a threshold, a lookback, a stop distance, and keep whichever setting scores highest. The search finishes and hands you a winner. Reasonable. Standard. It is what every optimizer, grid search, and hyperparameter tuner on earth does.

But look at the verb: scores highest. Highest on what? Before the search can crown anything, you had to hand it a single number to maximize — an objective function. PnL. Sharpe. Sharpe-on-the-bars-you-traded. Return-over-max-drawdown. You typed one of these, probably without much thought, and then the search spent a million evaluations doing exactly what you asked.

That one choice is not a formality. It is the entire decision. The search does not find "a good strategy" — there is no such thing in the abstract. It finds the strategy that maximizes the scalar you picked, and different scalars point at wildly different strategies on the same data. The objective is the secret hand on the wheel, and most of the time nobody is looking at it.

Here is the whole article in one table. One threshold search, one synthetic market with a real, known edge, six objectives — and the six strategies they select, measured on held-out data:

Objective (what the search maximizes)	Mean market exposure	In-sample Sharpe	Out-of-sample Sharpe	Degenerate winners
Raw PnL	0.859	1.76	1.61	0%
Full-timeline Sharpe	0.740	1.82	1.71	0%
Per-trade ("active") Sharpe	0.286	1.00	0.70	57%
Exposure floor ( $e_{\min}=0.20$ )	0.740	1.82	1.71	0%
Trade-count shrinkage (conf_k $=40$ )	0.523	1.54	1.31	20.7%
Robust (floor + conf_k)	0.675	1.78	1.70	0.2%

600 independent seeds, $T = 2000$ bars each, 80 candidate thresholds per search, in-sample and out-of-sample drawn independently. Annualized Sharpe (252 periods/year). "Degenerate" = the selected winner is in the market less than 5% of the time, or posts a non-positive out-of-sample Sharpe. The true optimum of this market is an out-of-sample annualized Sharpe of 1.77.

Read the third row until it stings. The per-trade Sharpe — a representative of the whole family of activity-conditional metrics (per-trade Sharpe, expectancy, van Tharp's SQN, win rate), all computed on only the bars you traded — selects a strategy that is out of sample worse than half the others, and does so degenerately 57% of the time. It is not a subtly worse objective. On this data it is a trap, and the search walks into it more than half the time. Now read the row just above it: plain full-timeline Sharpe never degenerates and scores 1.71 out of sample. That is the punchline of the entire repair, spoiled early — the honest fix is simply to measure on the full timeline; the fancier retrofits in the bottom rows, at their best, only ever match that number, never beat it. This article is the anatomy of that trap and its fix, with the ground truth known throughout so "did the objective pick the right strategy?" is a fact, not an opinion.

Act 1 — The secret decision: Goodhart's law is the search

A parameter search drawn as a funnel of many candidate strategy curves, with a single lens labeled OBJECTIVE stamping one curve as the winner while identical-looking curves are discarded, illustrating that the choice of metric silently selects the strategy

In 1975 the economist Charles Goodhart wrote a sentence that has outlived everything else he did:

"Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes."

The popular paraphrase, usually credited to Marilyn Strathern, is tighter: when a measure becomes a target, it ceases to be a good measure.

A parameter search is the purest possible instance of Goodhart's law. The objective function is the measure. The search is the pressure — thousands, millions of attempts to push that measure as high as it will go. And the search does not care in the slightest what you meant by the measure. It cares only about the number. If there is any way to make the number large that has nothing to do with a real, tradeable edge — trade rarely, sit flat most of the time, catch a couple of lucky outliers — the search will find that way, because finding the maximum is the one thing it is built to do.

This is the same failure the AI-safety literature calls reward hacking: an agent optimizing a proxy for what you want will exploit every gap between the proxy and the goal. Your search is that agent. "Sharpe ratio" is that proxy. "A strategy I can trust to trade real money next quarter" is the goal. The gap between them is where the whole discipline lives.

To watch the gap open, we need a world where we know the truth. So we build one.

The market. Each period, a predictor $s_t$ (a standard-normal signal) arrives, followed by the return $r_t$ it partially predicts. The edge is real but bounded — it exists only inside a moderate signal band $|s_t| \le 1$ , and vanishes outside it:

$r_t = \beta\, s_t \cdot \mathbf{1}\big[\,|s_t| \le 1\,\big] + \varepsilon_t, \qquad \beta = 0.3, \qquad \varepsilon_t \sim t_{4}\ (\text{unit variance}).$

Two design choices matter. First, the edge lives in the moderate band: extreme signals carry no predictive information, so a strategy should trade the middle and skip the tails. Second, the noise $\varepsilon_t$ is fat-tailed (Student- $t$ with 4 degrees of freedom, the kind of heavy tail real returns actually have) — but this ingredient is here for realism, not mechanism. It is tempting to say the fat tails are what make the trap possible, and we assumed exactly that until we ran the control: a Gaussian-noise version of this market (gaussian_control in the results, 300 seeds) reproduces the trap essentially unchanged — the per-trade objective still degenerates 55.7% of the time under Gaussian noise versus 57.0% with fat tails, and its out-of-sample Sharpe is 0.71 versus 0.70. So the trap is not about fat tails. It is a pure small-sample-plus-selection effect: take the maximum, over ~20 low-exposure thresholds, of a Sharpe computed on a handful of observations, and some lucky corner will always look spectacular. Any noise distribution does it; a strategy in the market for only a few bars can, by luck alone, sit through a couple of favorable moves and post an in-sample number that means nothing. We keep the fat tails because real returns have them, not because the trap needs them.

The strategy. A one-parameter family. Trade in the direction of the signal whenever its magnitude is below a threshold $\theta$ , otherwise stay flat:

$\text{pos}_t = \operatorname{sign}(s_t)\cdot \mathbf{1}\big[\,|s_t| \le \theta\,\big].$

A tiny $\theta$ trades only on the smallest, most unremarkable signals — a rare-trade lottery, in the market almost never. A $\theta$ near the band edge captures the whole real edge and is well-exposed. A huge $\theta$ trades everything, including the out-of-band bars that carry no edge and only add noise.

def gen_dataset(T, rng, beta=0.30, band=1.0, tail_df=4):
    s    = rng.standard_normal(T)
    edge = beta * np.where(np.abs(s) <= band, s, 0.0)          # edge ONLY inside |s| <= 1
    t    = rng.standard_t(tail_df, T) / np.sqrt(tail_df / (tail_df - 2.0))  # fat tails, unit var
    return s, edge + t

def simulate(s, r, theta):
    pos      = np.where(np.abs(s) <= theta, np.sign(s), 0.0)   # trade the band, skip outliers
    strat    = pos * r
    active   = pos != 0.0
    exposure = active.mean()                                   # fraction of bars in a position
    sharpe_full   = strat.mean()         / strat.std(ddof=1)            # on the WHOLE timeline
    sharpe_active = strat[active].mean() / strat[active].std(ddof=1)    # on ONLY the active bars
    return dict(exposure=exposure, n_trades=int(active.sum()),
                sharpe_full=sharpe_full, sharpe_active=sharpe_active, pnl=strat.sum())

Because we built the market, we can compute the truth directly — average performance at every threshold over all 600 seeds, out of sample. The true optimum sits at $\theta \approx 1.04$ : right at the edge of the signal band, in the market about 70% of the time (derived: a standard-normal signal falls inside $|s|\le 1.04$ with probability $P(|s|\le 1.04)=0.70$ ), posting an out-of-sample annualized Sharpe of 1.77. That is the number every objective is trying to find. Keep it in view: θ≈1.04, OOS Sharpe 1.77, about 70% of the timeline in the market. Anything an objective selects that is far from this is the objective failing, not the market being hard.

Act 2 — The trap: eight lucky trades, a Sharpe of 21, a mirage

A slot machine paying out on a single lucky spin, showing eight trade tickets and a towering in-sample per-trade Sharpe of 21 that collapses to 0.13 out of sample, dramatizing a rare-trade lottery winner selected by a naive objective

Now let a naive objective loose on one concrete draw of this market — seed 6. Full disclosure on the seed: it was not the first draw or a random one. We scanned seeds for a starkly degenerate per-trade winner and picked this one, precisely so the mechanism is impossible to miss. The outcome it shows is thoroughly typical — as Act 3 will confirm, the per-trade objective picks a sub-5%-exposure lottery in 56% of all seeds — but seed 6's magnitude sits at the extreme end of that distribution. Read it as an especially stark instance of a common failure, not a median one. We optimize the per-trade Sharpe: the Sharpe ratio computed on only the bars the strategy is actually in a position. This is an extremely natural thing to report. "When it trades, how good are its trades?" It feels like it isolates skill from idleness. It does the opposite.

Here is the strategy the per-trade Sharpe crowns on seed 6:

Threshold $\theta = 0.005$ — it trades only on the very tiniest signals.
Market exposure 0.4% — it is flat 99.6% of the time.
Eight trades. Eight. Over 2000 bars.
In-sample per-trade annualized Sharpe: 21.09.
In-sample full-timeline Sharpe: 0.82.
Out-of-sample full-timeline Sharpe: 0.13.

The per-trade metric reads 21.09 — a number no real strategy has ever posted, the kind of figure that would get a fund launched. And it is a complete mirage. Those eight trades happened to catch a few favorable moves; measured over just those eight bars, the ratio of mean to standard deviation is astronomical. But on the full timeline — where the strategy is flat 99.6% of the time — that "edge" contributes essentially nothing: a full-timeline Sharpe of 0.82 in sample, collapsing to 0.13 on fresh data. The winner the objective selected is, for all trading purposes, flat.

And it is not even a real edge at that threshold. Recall the market: the edge lives in the band $|s|\le 1$ , and $\theta = 0.005$ sits at the dead center where the signal is weakest. The true out-of-sample curve at $\theta = 0.005$ is −0.01 — indistinguishable from zero (derived from the ground-truth curve). The search did not find a small real edge. It found eight lucky draws of noise and reported them as a Sharpe of 21.

This is the trap in miniature: the per-trade Sharpe rewards the strategy for trading as rarely as possible, because the fewer bars you stand on, the easier it is for a couple of them to be lucky, and the metric never once asks "but were you actually in the market?" Seed 6's magnitude is cherry-picked — we went looking for a stark one — but its kind is not. Across all 600 seeds the per-trade Sharpe selects a degenerate winner (one that barely trades, or loses out of sample) in 57% of them, and a sub-5%-exposure lottery specifically in 56%. The typical degenerate pick is far tamer than seed 6's Sharpe of 21: averaged over all 600 seeds, the per-trade objective's winner has an in-sample per-trade Sharpe of 4.58 and mean exposure 0.286 — still flat most of the time, just not 99.6% flat. Seed 6 dramatizes the mechanism; the 56% is the part that should worry you. More than half the time, this everyday metric hands you a lottery ticket and calls it a strategy.

Act 3 — The statistical truth: six objectives, 600 seeds

Six objective functions as contestants on a leaderboard, the per-trade Sharpe glowing brightest in sample but stamped degenerate 57 percent while the exposure-aware objectives hold steady out of sample, illustrating that a great in-sample number does not mean a good selection

One seed proves nothing; it only illustrates. To measure an objective we have to ask what it selects on average, across many independent markets, and score that selection on data the search never saw. So: 600 seeds, each an independent draw of the market; on each, run the 80-threshold search under each objective; record the exposure, the in-sample and out-of-sample Sharpe of whatever it picked, and whether that pick was degenerate.

Objective	Mean exposure	In-sample Sharpe	Out-of-sample Sharpe	IS→OOS drop (abs.)	Degenerate
Raw PnL	0.859	1.76	1.61	0.15	0.0%
Full-timeline Sharpe	0.740	1.82	1.71	0.11	0.0%
Per-trade Sharpe	0.286	1.00	0.70	0.30	57%
Exposure floor ( $e_{\min}=0.20$ )	0.740	1.82	1.71	0.11	0.0%
conf_k shrinkage ( $k=40$ )	0.523	1.54	1.31	0.23	20.7%
Robust (floor + conf_k)	0.675	1.78	1.70	0.08	0.2%

The "IS→OOS drop" column is the absolute fall in annualized Sharpe from in-sample to out-of-sample (e.g. $1.00\to0.70$ is a drop of 0.30), not a percentage. And notice the "Exposure floor" row is byte-for-byte identical to "Full-timeline Sharpe": that is not a coincidence, and Act 5 explains why.

Three facts jump out, and each is a lesson.

The per-trade Sharpe is the only naive objective that degenerates. Its mean exposure is 0.286 — it selects strategies that are flat most of the time — and its in-sample Sharpe of 1.00 falls by 0.30 to an out-of-sample 0.70, the worst of the field. Note the tell: its in-sample number (1.00) is not even impressive, yet on any single seed it will happily report a per-trade figure of 21. The mean washes out because the lucky windows point in random directions; what survives to out-of-sample is only 0.70, and 57% of the individual selections are outright garbage.

Exposure-aware objectives are naturally safe. Raw PnL and full-timeline Sharpe never degenerate (0.0%). The reason is structural: both are measured over the entire timeline, so a strategy that is flat 99.6% of the time earns almost nothing under them. You cannot game a full-timeline metric by trading rarely — sitting flat is directly and automatically penalized, because flat bars are in the denominator. This is the single most important idea in the article, and we return to it in Act 6.

Raw PnL is safe but not optimal — it over-exposes. Look closely: raw PnL's mean exposure is 0.859, the highest of all, and its out-of-sample Sharpe (1.61) is a notch below full-timeline Sharpe (1.71) and the true optimum (1.77). PnL rewards being in the market, so the search pushes $\theta$ too high (on seed 6, raw PnL picks $\theta=1.84$ versus the optimal 1.04), dragging in out-of-band bars that carry no edge and only add noise. It does not blow up — but it drifts past the real optimum in the opposite direction from the per-trade trap. Different objective, different bias, same lesson: the metric chose the strategy.

The two rows we have not discussed yet — the exposure floor and conf_k — are the repair. That is the next act.

Act 4 — Why eight trades can never be trusted

A single Sharpe point estimate marked with an enormous confidence-interval error bar because it rests on only eight observations, beside a ruler labeled minimum track record length, illustrating that a ratio measured on a handful of trades is statistically meaningless

Before repairing the trap, it is worth being precise about why eight trades produce a Sharpe of 21 that means nothing — because the fix follows directly from the reason.

A Sharpe ratio is an estimate, and estimates have error bars. Andrew Lo's 2002 result gives the standard error of a Sharpe ratio estimated from $T$ observations, under the most generous possible assumption (IID Gaussian returns):

$\operatorname{SE}\big(\widehat{SR}\big) \approx \sqrt{\frac{1 + \widehat{SR}^{\,2}/2}{T}}.$

The error shrinks only as $1/\sqrt{T}$ . Feed it the trap. The per-trade Sharpe on seed 6 is $21.09$ annualized, which is $1.33$ per observation, computed on $T = 8$ bars. The standard error is

$\operatorname{SE} \approx \sqrt{\frac{1 + 1.33^2/2}{8}} \approx 0.49 \ \text{per observation} \ = \ \textbf{7.7 annualized}$

(derived from Lo's formula). The point estimate is $21.09$ ; its one-sigma error bar is of order $\pm 7.7$ — read this as an illustrative order-of-magnitude, not a calibrated confidence interval, since the formula assumes IID Gaussian returns that our fat-tailed $t_4$ noise violates. Even so, the message is unmistakable: the "Sharpe of 21" is a number drawn from a distribution so wide it carries essentially no information — and that is the charitable calculation, because Mertens' extension shows that fat tails and skew only inflate the standard error further. A rare-trade backtest's Sharpe is less trustworthy than its point value in every direction at once: too few observations, and the wrong distribution.

This is exactly what the Minimum Track Record Length formalizes (Bailey & López de Prado, 2012). It inverts the question — how many observations do I need before I'm allowed to believe a Sharpe of this size at confidence $p$ ? —

$\text{MinTRL} = 1 + \Big[\,1 - \hat\gamma_3\,\widehat{SR} + \tfrac{\hat\gamma_4 - 1}{4}\,\widehat{SR}^{\,2}\,\Big]\left(\frac{Z_{1-p}}{\widehat{SR} - SR^*}\right)^{2},$

turning "trust few-trade backtests less" into an explicit, checkable number of trades. The deep point for objective design is this: a good objective should enforce a minimum track record from the inside, rather than leaving a human to notice, after the fact, that the winner rests on eight observations. The per-trade Sharpe does the opposite — it is maximized by driving the observation count toward the minimum. Any objective whose optimum sits at "as few trades as possible" is, by construction, an objective that seeks out its own least reliable estimate.

Two failures compound in the trap, and naming both tells us how to fix it. First, small-sample noise: eight observations cannot pin down any ratio. Second, selection: those eight bars were not handed to us — the search chose the threshold that landed on them, partly because they were lucky. The search is a maximizer; it will always find the corner of the space where noise happens to look like signal. You cannot out-clever this with a better point estimate. You have to change what "best" means so the lucky corner is not the maximum.

Act 5 — The repair: an exposure floor and a trade-count shrinkage

A control panel with two knobs labeled exposure floor and conf_k shrinkage, both lifting an out-of-sample performance surface up onto a high flat plateau while a degeneracy indicator goes dark, illustrating two independent fixes that together recover the true optimum

We have two named diseases — trades too rarely and rests on too few observations — so we write two cures, each aimed at one.

Cure 1: an exposure floor. The simplest possible fix. Reject outright any strategy that is not in the market at least $e_{\min}$ of the time — if you barely trade, your score is $-\infty$ and the search cannot select you. But there is an honest subtlety in what you floor, and it is the quiet lesson of this whole article. As a standalone objective we floored full-timeline Sharpe, and on this market that changes nothing at all: full-timeline Sharpe's own winner already sits at ~74% exposure, so a 20% floor never once binds. That is exactly why the "exposure floor" and "full-timeline Sharpe" rows in the tables above are byte-for-byte identical — bolt a floor onto an already-safe metric and you have simply re-derived full Sharpe. The floor only does visible work when it is guarding a metric that would otherwise sprint to the corner: a per-trade metric, as in the robust objective below. In other words, "require exposure" and "measure on the full timeline" are, on this data, two names for the same intervention.

Cure 2: a trade-count shrinkage — "conf_k". For when you are stuck with a per-trade metric and want a soft correction instead of a hard cutoff: discount the Sharpe continuously by how many trades it rests on. Multiply by $n/(n+k)$ , where $n$ is the trade count and $k$ is a fixed "confidence constant" — a trades-equivalent prior strength chosen before the search:

$\text{score}(\theta) = \widehat{SR}(\theta)\cdot \frac{n(\theta)}{n(\theta) + k}.$

As $n \to 0$ the score is crushed to zero regardless of how large the raw Sharpe is; as $n \to \infty$ the score converges to the raw Sharpe. This is the same corrective logic as MinTRL and the standard error of Act 4 — shrink a small-sample estimate toward zero as a decreasing function of its sample size — folded directly into the objective instead of applied as a post-hoc filter. The closest named precedent is van Tharp's System Quality Number ( $\text{SQN} = \sqrt{N}\cdot \overline{\text{trade}}/\sigma_{\text{trade}}$ ), which likewise makes a per-trade quality metric scale with the trade count $N$ — though the functional form differs ( $\sqrt{N}$ grows without bound, whereas $n/(n+k)$ saturates at 1). In shape ours is a Bayesian precision-weighted / empirical-Bayes-style shrinkage; it is our construction for this problem, not a named estimator lifted from the literature.

def obj_active_sharpe(m):                  # the trap: Sharpe on only the active bars
    return m["sharpe_active"]

def _shrink(n, conf_k):                     # trade-count shrinkage n / (n + k)
    return n / (n + conf_k) if (n + conf_k) > 0 else 0.0

def obj_confk(m, conf_k=40.0):              # few trades -> little credit
    return m["sharpe_active"] * _shrink(m["n_trades"], conf_k)

def obj_robust(m, e_min=0.20, conf_k=40.0): # both cures at once
    if m["exposure"] < e_min:               # floor: reject strategies that barely trade
        return -np.inf
    return m["sharpe_active"] * _shrink(m["n_trades"], conf_k)

Now the honest part: how much floor, how much shrinkage? Sweep both and read the whole surface. Each cell is the mean out-of-sample Sharpe across 200 seeds (a one-third subset of the 600, to keep the two-dimensional sweep cheap) with the degeneracy rate beside it:

$e_{\min}$ \ conf_k	$k=0$	$k=40$	$k=80$
0.00	0.66 (59.5%)	1.26 (22.5%)	1.47 (11.5%)
0.05	1.43 (10.0%)	1.53 (6.0%)	1.60 (4.0%)
0.10	1.64 (1.5%)	1.65 (1.0%)	1.67 (1.0%)
0.20	1.71 (0.0%)	1.71 (0.0%)	1.71 (0.0%)
0.35	1.73 (0.0%)	1.73 (0.0%)	1.73 (0.0%)

Mean out-of-sample annualized Sharpe, with degeneracy rate in parentheses. Top-left cell $(0,0)$ is the raw per-trade Sharpe — no floor, no shrinkage: OOS 0.66, degenerate 59.5%. That is the same objective as Act 3's per-trade row, which read 0.70 / 57%; the small gap is purely the seed set — this sweep uses 200 seeds, the Monte Carlo used all 600. Same metric, smaller sample.

The surface tells a clean story in three readings.

Each cure works alone. Move right along the top row (add shrinkage, no floor): OOS climbs $0.66 \to 1.26 \to 1.47$ and degeneracy falls $59.5\% \to 22.5\% \to 11.5\%$ . Move down the left column (add floor, no shrinkage): OOS climbs $0.66 \to 1.43 \to 1.64 \to 1.71$ and degeneracy falls $59.5\% \to 10\% \to 1.5\% \to 0\%$ . Either knob, turned alone, independently lifts out-of-sample performance and kills degeneracy. The exposure floor is the stronger single lever here, because it attacks the trap's defining feature — near-zero exposure — head on.

Together they reach the plateau — and the plateau is just full-timeline Sharpe. By $e_{\min} = 0.20$ the row is flat at OOS 1.71 with 0% degeneracy across every shrinkage level; push to $e_{\min}=0.35$ and it inches to 1.73. But look hard at what that 1.71 is: it is the exact score plain full-timeline Sharpe posts in Act 3 with no floor and no shrinkage at all. At their best, the retrofits do not beat full-timeline Sharpe — they reconstruct it. And the fully repaired robust objective does not even quite get there: across all 600 seeds it lands at OOS 1.70 with a residual 0.17% degeneracy, a hair under full Sharpe's 1.71 / 0% — it is weakly dominated by the simpler metric. A modest middle setting, $e_{\min}=0.10$ with $k=40$ , reaches OOS 1.65 at 1% degeneracy — handy if a per-trade metric is forced on you, but never a reason to prefer one.

The exact numbers are scale-dependent — the shape is the result. The specific values $e_{\min}=0.20$ , $k=40$ that fully repair this market are tuned to this data-generating process; on a different market with different trade frequencies and tail thickness, the plateau sits elsewhere. What generalizes is not the coordinates but the surface: a monotone lift in both directions, degeneracy driven to zero, a plateau at the truth. You find your own coordinates by sweeping, exactly as above.

Put both cures together — the robust objective, floor 0.20 plus conf_k 40 — and return to seed 6. The trap crowned $\theta = 0.005$ , eight trades, a full-timeline OOS Sharpe of 0.13. The robust objective instead selects $\theta = 0.979$ : market exposure 0.66, 447 trades, out-of-sample annualized Sharpe 1.77. That $\theta = 0.979$ is one grid point below the true optimum $\theta = 1.04$ , so it recovers a near-optimal, well-exposed threshold rather than landing on the bullseye — its single-seed out-of-sample Sharpe (1.77) happens to coincide with the population optimum. Same data, same search, same 80 candidate thresholds. Only the definition of "best" changed, and it moved the winner from a flat eight-trade mirage to the real, well-exposed edge — which is, tellingly, the very same threshold plain full-timeline Sharpe picks on this seed.

One caution the sweep makes explicit: conf_k alone is not enough on this market. At $k=40$ with no floor, degeneracy is still 22.5% across seeds — and on seed 6 specifically, conf_k alone picks $\theta = 0.015$ , 35 trades, an out-of-sample Sharpe of −0.06. Thirty-five trades survive a shrinkage of $35/(35+40) \approx 0.47$ with enough score left to win. The exposure floor is what closes that last gap, because it targets the trap's true signature — being flat — directly, rather than trusting trade count as a proxy for it.

Act 6 — The deeper lesson: measure on the whole timeline

Two ways of scoring the same strategy contrasted: a narrow spotlight illuminating only the few bars a strategy traded versus a floodlight measuring the entire timeline including the flat stretches, illustrating that exposure-aware objectives cannot be gamed by rare lucky trades

Step back from the repair and notice what actually separated the safe objectives from the trap. It was not sophistication. Raw PnL and full-timeline Sharpe are simpler than the per-trade Sharpe, and they never degenerated — 0% across 600 seeds — with no floor, no shrinkage, no tuning at all.

The dividing line is a single property: what window does the metric measure? The per-trade Sharpe measures only the bars the strategy chose to stand on — a self-selected window the search can shrink at will. Full-timeline Sharpe and total PnL measure the entire timeline, flat bars included. And you cannot make a full-timeline metric large by trading rarely, because every hour you sit flat is an hour in the denominator earning nothing. The exposure floor and conf_k are, in the end, just ways of retrofitting the per-trade metric with the exposure-awareness that full-timeline metrics have for free — and the sweep already told us the ceiling of that retrofit: at its best it matches full-timeline Sharpe (OOS 1.70 vs 1.71), never beats it. If you are free to choose the window, choose the whole timeline and skip the retrofit entirely.

So the design principle, stated plainly:

Design the objective so it cannot be gamed by rare lucky trades. You have three tools, in rough order of preference:

Measure on the full timeline. The default that should almost never be departed from. Full-timeline Sharpe and total return are exposure-aware by construction — idleness is automatically penalized because flat bars count. If you find yourself reporting a metric computed on "only the bars we were active," stop and ask what the search will do with the freedom to choose those bars.
Require exposure. If you must use an activity-conditional metric, floor the exposure so the search cannot select a strategy that barely trades. This is the strongest single lever against the specific trap.
Shrink by trade count. Discount any ratio by $n/(n+k)$ so a Sharpe resting on a handful of observations earns a fraction of the credit of one resting on thousands. This is the objective-level enforcement of the Minimum Track Record Length: a number from few observations is unreliable (Act 4), so an honest objective prices that unreliability in rather than trusting a human to catch it later.

None of this makes the search smarter. It makes the target honest, so that when the search does exactly what it always does — find the maximum — the maximum is a strategy you actually want.

Honesty notes

Three caveats, stated plainly, because a controlled study earns its conclusions only by naming its limits.

The market is synthetic, and deliberately so. A standard-normal signal, a linear edge confined to $|s|\le 1$ , fat-tailed Student- $t$ ( $4$ ) noise — chosen for controlled ground truth, not for market realism. We can only prove that an objective picks the wrong strategy by running it on data where we know which strategy is right. Real markets are non-stationary, autocorrelated, and regime-shifting. The fat tails are a realistic ingredient we kept, but — contrary to a natural first guess, and contrary to an earlier draft of this very article — they are not what powers the trap: a Gaussian-noise control (300 seeds) degenerates 55.7% of the time versus 57.0% here, with out-of-sample Sharpe 0.71 versus 0.70. The trap is a small-sample-plus-selection artifact that survives with or without heavy tails. The deliverable is the diagnosis and the repair pattern, not a strategy and not a universal constant.
The repairing values are scale-dependent. The specific floor $e_{\min}=0.20$ and shrinkage $k=40$ that fully close the trap are fitted to this data-generating process — its trade frequencies, its tail thickness, its edge size. On other data the plateau moves. What transfers is the shape of the sweep surface (monotone lift in both knobs, degeneracy to zero, a plateau at the truth) and the method for finding your own coordinates: sweep both and read the surface, do not copy the numbers.
conf_k is our construction, not a named estimator. The trade-count shrinkage $\widehat{SR}\cdot n/(n+k)$ is a Bayesian precision-weighted / empirical-Bayes-style device we built for this problem; its rationale is grounded in the verified Lo/Mertens standard-error result and the Bailey–López de Prado MinTRL, and its closest named relative is van Tharp's System Quality Number ( $\sqrt{N}\cdot \overline{\text{trade}}/\sigma_{\text{trade}}$ , a different functional form), but we do not claim $n/(n+k)$ itself appears under a name in the literature. Its companion cures — the exposure floor, full-timeline measurement — are standard practice stated precisely. And note which objectives were already safe here: raw PnL and full-timeline Sharpe never needed repair, because they are exposure-aware to begin with — so much so that flooring full-timeline Sharpe reduces to full-timeline Sharpe, its winner already clearing any reasonable floor. The trap is specifically the per-trade / active Sharpe — and even the fully repaired per-trade objective only matches full-timeline Sharpe (OOS 1.70 vs 1.71), never beats it. The primary lesson is not the repair; it is to measure on the full timeline in the first place.

Takeaways

The objective is the decision, not a formality. A search does not find "a good strategy" — it finds the maximizer of whatever scalar you handed it, and different scalars select wildly different strategies on identical data. Choosing the objective is choosing the strategy; everything downstream is bookkeeping. That is Goodhart's law: the moment your metric becomes the search's target, the search will exploit every gap between it and what you meant.
The per-trade Sharpe is a trap. Measured on only the bars a strategy trades, it is maximized by trading as rarely as possible — the fewer observations, the easier a couple of lucky moves inflate the ratio (a Gaussian control confirms fat tails are not required; it is a small-sample-plus-selection effect). Across 600 seeds it picks a sub-5%-exposure lottery in 56% of them and degenerates in 57%; the typical degenerate pick averages an in-sample per-trade Sharpe of 4.58. On one deliberately stark seed it crowned an eight-trade, 0.4%-exposure strategy with an in-sample per-trade Sharpe of 21.09 collapsing to a full-timeline out-of-sample Sharpe of 0.13. A ratio built on eight observations has a standard error of order ±7.7 annualized (Act 4) — it was never information.
Exposure-aware objectives are naturally safe. Raw PnL and full-timeline Sharpe never degenerated (0%), because they measure the entire timeline and idleness is automatically penalized. You cannot game a full-timeline metric by trading rarely. Raw PnL's only flaw is the opposite bias — it over-exposes (mean exposure 0.859, OOS 1.61 versus the true 1.77), drifting $\theta$ past the optimum to be in the market more.
The repairs work — but they only get you back to full-timeline Sharpe. An exposure floor and a trade-count (conf_k) shrinkage each independently lift out-of-sample Sharpe and drive degeneracy toward zero; together they reach a plateau. But that plateau is full-timeline Sharpe: degeneracy falls from 59.5% at $(e_{\min},k)=(0,0)$ to 0% by $e_{\min}=0.20$ and OOS Sharpe climbs from 0.66 to 1.71 — the exact number plain full Sharpe posts unaided, and the fully repaired robust objective actually lands a hair under it (OOS 1.70, 0.17% degeneracy: weakly dominated). On seed 6 the robust objective recovers a near-optimal, well-exposed threshold ( $\theta = 0.979$ , one grid point below the true $1.04$ ; 447 trades; OOS Sharpe 1.77). Treat conf_k as a fallback for when a per-trade metric is forced on you, not as an upgrade over measuring the full timeline. The exact coordinates are scale-dependent; the shape of the surface is the transferable result.
Design the objective so it cannot be gamed by rare lucky trades. In order of preference: measure on the full timeline (default), require exposure (strongest single lever), shrink by trade count (objective-level Minimum Track Record Length). None of these makes the search cleverer — they make the target honest, so that the maximum the search inevitably finds is a strategy you would actually trade.

A parameter search is an obedient genie. It grants precisely the wish you phrase, not the one you meant — and "maximize this metric" is the wish. Phrase it as a per-trade Sharpe and it will conjure eight lucky trades and call them a fortune. Phrase it so that sitting flat is punished and a handful of trades earns only a handful of credit, and the same genie, on the same data, hands you the real edge. The strategy you deploy was chosen the moment you typed the objective. Choose it on purpose.

The full experiment — the synthetic market, the six objectives, the 600-seed Monte Carlo, and the repair-sweep surface, every number regenerable from one deterministic script — is in the companion paper at objective-design.marketmaker.cc, with code and data at github.com/suenot/objective-design-degeneracy.

This is one front in the same war our other studies fight from different angles. The Deflated Sharpe Ratio prices the winner of a search after multiple testing — where this article asks whether the objective picked the right strategy at all, DSR asks whether the strategy it picked beats what luck alone would produce. The forthcoming probability of backtest overfitting study attacks the same selection bias from the resampling side, scoring the procedure rather than the winner. And the look-ahead bias taxonomy catalogs the other great manufacturer of fake Sharpe — a leak from the future — which produces the identical symptom (a glorious backtest that dies live) through an entirely different mechanism. Objective design, deflation, overfitting probability, look-ahead: four names for one discipline — not being fooled by your own backtest.

Objective-Function Design: The Metric You Optimize Secretly Picks Your Strategy

Act 1 — The secret decision: Goodhart's law is the search

Act 2 — The trap: eight lucky trades, a Sharpe of 21, a mirage

Act 3 — The statistical truth: six objectives, 600 seeds

Act 4 — Why eight trades can never be trusted

Act 5 — The repair: an exposure floor and a trade-count shrinkage

Act 6 — The deeper lesson: measure on the whole timeline

Honesty notes

Takeaways

Authors

Read More

The Probability of Backtest Overfitting: Did Your Search Beat a Coin Flip?

The Deflated Sharpe Ratio: How Many of Your Backtest 'Winners' Survive Multiple Testing?

Walk-Forward Optimization: The Only Honest Strategy Test

Act 1 — The secret decision: Goodhart's law is the search

Act 2 — The trap: eight lucky trades, a Sharpe of 21, a mirage

Act 3 — The statistical truth: six objectives, 600 seeds

Act 4 — Why eight trades can never be trusted

Act 5 — The repair: an exposure floor and a trade-count shrinkage

Act 6 — The deeper lesson: measure on the whole timeline

Honesty notes

Takeaways

Authors

Read More

The Probability of Backtest Overfitting: Did Your Search Beat a Coin Flip?

The Deflated Sharpe Ratio: How Many of Your Backtest 'Winners' Survive Multiple Testing?

Walk-Forward Optimization: The Only Honest Strategy Test

Stay Ahead of the Market

Success!

Sign In