The Backtest Speed Ladder: 298x on a Laptop CPU, Identical PnL to the Last Trade

Part of the "Backtests Without Illusions" series.

📄 This article grew into a research paper. One path-dependent backtest kernel is implemented five ways — from naive pandas up to a parallel numba kernel — with every rung cross-checked to produce identical per-combo PnL, so the only thing that differs is speed. Read the paper online (interactive version + PDF) at speed-ladder.marketmaker.cc, code and data at github.com/suenot/backtest-speed-ladder.

Seventy seconds. That is how long the naive reference implementation takes to sweep 80 parameter combinations of one moving-average strategy over 150,000 bars: pandas rolling().apply() for the indicators, a plain Python loop for the trades. It is the profile a huge fraction of real-world research code runs on, because it is the profile that falls out of writing the strategy the obvious way.

The same sweep, on the same laptop, producing the same PnL for every combination down to the last trade: 0.23 seconds.

The gap between those two numbers — a measured 298x — is the subject of this article. Not one percentage point of it came from new hardware. No GPU was involved (none is even available on this machine in the CUDA sense). Every rung of the ladder is the same strategy, the same data, the same fees, the same trade count, verified by an equivalence gate that fails the whole benchmark if any implementation's per-combo results diverge. What changed is only how the work is expressed: what runs in the interpreter, what runs compiled, and what runs in parallel. And because a deliberately slow baseline can flatter any headline number, one more figure up front: even against a competent vectorized numpy implementation — the code a strong numpy programmer would ship — the finished engine is still about 13x faster.

When a parameter search is slow, the reflex is to reach for bigger hardware — a GPU, a cluster, a cloud budget. The measured reality of this experiment points somewhere much less glamorous: the bottleneck was the engine (an interpreted inner loop doing per-window Python calls) and the orchestration (running independent combos serially on one core). Both are fixable in an afternoon, on the machine you already own, with zero change to the results.

Here is the whole ladder up front. Everything below is the anatomy of each step.

Rung	Implementation	Wall time	Speedup	Combos/s
M0	pandas: `rolling.apply` + Python bar loop	69.92 s	1.0x	1.1
M1	numpy: sliding-window WMA + vectorized trades	3.07 s	22.7x	26.0
M2	numba: `@njit` WMA + `@njit` event loop	1.98 s	35.3x	40.4
M3	numba `prange`: threads across combos	0.32 s	217.6x	248.9
M4	process pool + numba: processes across combos	0.23 s	297.9x	340.9

Apple M2 Max (12 cores), Python 3.14.6, numpy 2.4.3, numba 0.64.0, BLAS (Accelerate) pinned to one thread so the single-threaded rungs are genuinely single-core. 150,000 bars × 80 combos, best-of-3 wall time, JIT warm-up excluded. All rungs — the pandas baseline included — timed in full and verified to produce identical per-combo PnL and trade counts on all 80 combos.

One kernel, five implementations

Five rungs of one staircase: the same backtest kernel climbing from a 70-second pandas baseline to a 0.23-second parallel numba run, each step verified to produce identical PnL

To make a speed comparison mean anything, the thing being computed has to be pinned down exactly, and every implementation has to be proven to compute it. So the experiment fixes one strategy kernel and holds it constant across all five rungs.

The kernel is an HMA/HMA3 cross — a stop-and-reverse system on two Hull-style moving averages. The building block is the weighted moving average:

$\mathrm{WMA}_p(x)_i = \frac{\sum_{j=1}^{p} j \cdot x_{i-p+j}}{\sum_{j=1}^{p} j}$

The Hull Moving Average composes three of them to cut lag:

$\mathrm{HMA}_n(x) = \mathrm{WMA}_{\lfloor\sqrt{n}\rceil}\Big(2\,\mathrm{WMA}_{\lfloor n/2\rceil}(x) - \mathrm{WMA}_{n}(x)\Big)$

and HMA3 is a smoother sibling built from WMAs at roughly $n/6$ , $n/4$ and $n/2$ , smoothed once more. Per parameter combination that is seven WMA passes over six distinct window lengths — a real indicator stack, not a toy.

The trading rule is deliberately, usefully stateful: direction is long when HMA is below HMA3 and short otherwise; open a position on the first defined direction; on every cross, close the position, book PnL minus a 0.09% round-trip fee, and reverse. The position carries across bars — what you do at bar $i$ depends on state accumulated since the last cross. This path dependence is the whole point of the experiment: it is the property that makes backtests different from generic dataframe pipelines, and (as we will measure) it complicates the GPU question — though not, it turns out, in the way the folklore says.

The rest of the setup, so you can judge the numbers:

Data: 150,000 bars of synthetic geometric Brownian motion, seeded (seed=42). Performance here is bound by array size and window lengths, not by which price path you feed it — and a synthetic series makes the whole experiment deterministic and reproducible by anyone.
Grid: 80 distinct HMA lengths spread over $[6, 200]$ — so the sweep contains both cheap short-window combos and expensive long-window ones, like a real grid does.
Timing: wall-clock, best-of-3 per rung, with JIT compilation warmed outside the timer and pool workers warmed before the clock starts. Every rung — the pandas baseline included — is timed in full across all 80 combos. BLAS (Apple's Accelerate) is pinned to a single thread, so the single-threaded rungs are genuinely single-core: the numpy rung is not quietly multithreading its matvecs behind the comparison's back.
Equivalence gate: after timing, every rung's per-combo (PnL, trade count) vector is compared against the reference — trade counts must match exactly, PnL to within an absolute $10^{-6}$ percentage points. The committed run reports all_ok: true for every rung, the pandas baseline included, on all 80 combos. If this gate fails, there is no benchmark — there are just five programs computing five different things at five different speeds, which is how a lot of "our engine is 100x faster" claims quietly work.

One number from the equivalence block is worth a moment of honesty: the fingerprint for the first combo is a PnL of −5165.58 percentage points across 57,029 trades. That is not a strategy result to be embarrassed about — it is the shortest HMA length (6) flipping on nearly every wiggle of a random walk and paying 0.09% each time, exactly as it should. It is a correctness fingerprint, not a tradable backtest. Do not read alpha into it; read determinism into it — five implementations landing on the same 57,029 trades and the same PnL to six decimals is what "identical" means here.

With that established, every speedup below is pure speed. Nothing was approximated away.

Rung M0: the naive pandas profile — 69.9 s

Anatomy of the naive pandas baseline: a rolling.apply window spawning a Python lambda call for every one of 150,000 bars while the interpreter loop crawls beneath it

The baseline is not a strawman. It is the code you get when you write a WMA the way the pandas documentation suggests and the event loop the way the strategy description reads:

def pd_wma(s: pd.Series, period: int) -> np.ndarray:
    w = np.arange(1, period + 1, dtype=np.float64)
    w /= w.sum()
    return s.rolling(period).apply(lambda x: np.dot(x, w), raw=True).to_numpy()

def run_pandas_one(close, length):
    h, h3 = pd_hma(close, length), pd_hma3(close, length)  # 7 rolling.apply WMAs
    total, ntr, prev_dir, entry, pos = 0.0, 0, 0, 0.0, 0
    for i in range(len(close)):                            # Python bar loop
        if np.isnan(h[i]) or np.isnan(h3[i]):
            continue
        d = 1 if h[i] < h3[i] else -1
        if prev_dir == 0:
            prev_dir, pos, entry = d, d, close[i]
            continue
        if d != prev_dir:                                  # cross: close + reverse
            pnl = ((close[i] - entry) if pos == 1
                   else (entry - close[i])) / entry * 100 - FEE
            total += pnl
            ntr += 1
            pos, entry, prev_dir = d, close[i], d
    return total, ntr

Why is this slow? Not because pandas is "bad" — because of where the iteration lives. rolling(period).apply(lambda ...) is a Python-level loop wearing a vectorized costume. For every one of 150,000 bars, pandas materializes a window, crosses the C/Python boundary, invokes a Python callable, and boxes the result. Even with raw=True (which at least hands the lambda a bare ndarray instead of a Series), the per-call interpreter overhead dwarfs the ~dozens-to-hundreds of FLOPs the window actually needs. Multiply by seven WMA passes per combo, and the indicator stack alone is millions of interpreter round-trips. Then the bar loop runs another 150,000 interpreted iterations per combo, each doing bounds-checked indexing on numpy scalars, boxing floats, and dispatching dynamically on types the interpreter re-discovers every single time.

The result: 69.92 s for the sweep, about 0.87 s per combo, a throughput of 1.1 combos per second. On an 80-combo grid you shrug and wait a minute. The problem is that nobody runs 80-combo grids for long — and this cost scales linearly forever. We will come back to that.

Rung M1: numpy — stop calling Python in a loop — 3.07 s, 22.7x

The first rung up eliminates both interpreter loops at once, and it is worth separating the two tricks because they have very different generality.

The indicator side is the easy, fully general one. A weighted moving average over all windows is just a matrix–vector product against a strided view of the input — no copies, one BLAS call:

def vec_wma(x: np.ndarray, period: int) -> np.ndarray:
    w = np.arange(1, period + 1, dtype=np.float64)
    win = np.lib.stride_tricks.sliding_window_view(x, period)  # zero-copy view
    out = np.full(len(x), np.nan)
    out[period - 1:] = win @ w / w.sum()                       # one matvec
    return out

sliding_window_view builds a (n − p + 1, p) view of the same memory, and win @ w computes every window's dot product in compiled code. The million lambda invocations become one library call.

The trade side is the interesting one, because the event loop is stateful — and yet, for this kernel, it vectorizes. The insight is that the position at any bar depends only on the sign of HMA − HMA3, not on any trade outcome. State never feeds back into decisions. So the entire loop collapses into "find the sign flips, gather prices at those indices":

d = np.where(h[idx] < h3[idx], 1, -1)             # direction per valid bar
flips = np.flatnonzero(np.diff(d) != 0) + 1       # bars where it crosses
cross = idx[np.concatenate(([0], flips))]         # entry/exit indices
side  = d[np.concatenate(([0], flips))]
entries, exits, s = close[cross[:-1]], close[cross[1:]], side[:-1]
pnl = np.where(s == 1, (exits - entries) / entries,
               (entries - exits) / entries) * 100 - FEE
return float(pnl.sum()), int(pnl.size)

3.07 s, a 22.7x speedup, 26.0 combos per second — on one core, with BLAS pinned to a single thread. This rung deserves a label: it is the competent baseline, the implementation a strong numpy programmer would ship, and the fair yardstick for everything above it. But two honest caveats travel with this rung.

First, this vectorization is a strategy-specific analytical rewrite, not a mechanical transformation. It exists because the kernel is stop-and-reverse with no stops, no trailing exits, no position sizing that depends on running PnL. Add a stop-loss — the most ordinary feature imaginable — and the exit at bar $i$ changes which entry exists at bar $j > i$ , state feeds back into the path, and the closed form evaporates. Most production kernels live on the wrong side of that line.

Second, this is the rung where correctness goes to die. The flip-index bookkeeping (+1 here, [:-1] there, the first-direction seeding) is exactly the kind of code that produces off-by-one execution bugs — the same species of bug our look-ahead taxonomy showed can manufacture a Sharpe of 15 from noise. The equivalence gate is not a formality on this rung; it is the only reason to trust it. Clever vectorized rewrites without an equivalence check against a dumb reference implementation are how engines drift away from the strategy they claim to test.

Rung M2: numba — compile the loop you actually want to write — 1.98 s, 35.3x

A Python event loop passing through the numba JIT compiler and emerging as tight machine code: the same branchy bar-by-bar logic, compiled instead of interpreted

Rung M2 takes the opposite philosophy: instead of contorting the algorithm to fit vectorized primitives, write the naive loops — and compile them. Numba (Lam, Pitrou & Seibert, 2015) JIT-compiles a numeric subset of Python through LLVM into machine code:

@njit(cache=True)
def nb_wma(x, period):
    n = x.shape[0]
    out = np.full(n, np.nan)
    wsum = period * (period + 1) / 2.0
    for i in range(period - 1, n):        # the "slow" loop, now machine code
        s = 0.0
        for j in range(period):
            s += x[i - period + 1 + j] * (j + 1)
        out[i] = s / wsum
    return out

@njit(cache=True)
def nb_sweep(close, half, full, sq, p3, p2, pi, fee):
    h  = nb_wma(2.0 * nb_wma(close, half) - nb_wma(close, full), sq)
    a  = 3.0 * nb_wma(close, p3) - nb_wma(close, p2) - nb_wma(close, pi)
    h3 = nb_wma(a, pi)

The event loop inside nb_sweep is textually the M0 loop. Branches, continue, state carried in locals — all of it. Under @njit those locals live in registers, the branches are real jump instructions, and the per-iteration cost drops from microseconds of interpreter dispatch to nanoseconds.

1.98 s — 35.3x over pandas, but only about 1.6x over numpy (derived: 3.07/1.98). That modest step is itself instructive: numpy's inner loops were already compiled, so numba's win on the feature math is limited to skipping window materialization and intermediate arrays. The transformative part is elsewhere:

The event loop is free now — and "free" is measured, not rhetorical. M1 spent its cleverness making the trade logic vectorizable. M2 makes that cleverness unnecessary — the naive, auditable, easy-to-modify loop runs at machine speed. Timing the feature stage separately from the trade loop inside this compiled kernel attributes 99.3% of its time to the WMA feature math and just 0.7% to the stateful event loop. You can add a stop-loss tomorrow without a research project — and hold on to that split; it re-decides the GPU argument below.
It unlocks the next two rungs. A compiled, GIL-releasing, allocation-light kernel is the unit of work that parallel orchestration needs. You cannot productively parallelize M0 — twelve copies of slow are still slow, just warmer.

One methodological note: numba compiles on first call, and that compilation (hundreds of milliseconds) must not be inside the timer — the harness warms the JIT on a 500-bar slice before measuring, and cache=True persists compiled kernels across process launches. Benchmarks that "forget" this detail produce numba numbers that are either unfairly bad (cold compile included) or unreproducible.

Rung M3: prange — the parallelism you already had — 0.32 s, 217.6x

Eighty independent parameter combos fanned out across twelve CPU cores: performance and efficiency cores pulling unequal window lengths in parallel

Here is the observation that makes mass parameter search special: the 80 combos are completely independent. No shared state, no ordering, no communication. This is embarrassingly parallel work that rungs M0–M2 were running on one core out of twelve, out of pure habit.

Numba makes the fix nearly syntactic — swap the combo loop's range for prange:

@njit(parallel=True, cache=True)
def nb_sweep_all(close, params, fee):
    N = params.shape[0]
    totals = np.empty(N, dtype=np.float64)
    ntrs = np.empty(N, dtype=np.int64)
    for k in prange(N):                    # threads across combos
        t, ntr = nb_sweep(close, params[k, 0], params[k, 1], params[k, 2],
                          params[k, 3], params[k, 4], params[k, 5], fee)
        totals[k] = t
        ntrs[k] = ntr
    return totals, ntrs

Because nb_sweep is nopython-compiled, it holds no GIL, and numba's threading layer fans the iterations across all 12 cores. The read-only close array is shared by all threads at zero cost.

0.32 s — 217.6x over pandas, 248.9 combos per second. The step over single-threaded M2 is about 6.2x on 12 cores (derived: 1.98/0.32), and the shortfall from "ideal 12x" is worth being honest about rather than hiding: the M2 Max's 12 cores are 8 performance + 4 efficiency cores, so the nominal ceiling was never 12x; the 80 combos have wildly unequal costs (a length-6 HMA is far cheaper than a length-200 one), so threads finish ragged; and each kernel call allocates its intermediate arrays from a shared allocator. Parallel speedups on real machines look like this. Anyone quoting clean Nx-on-N-cores for heterogeneous tasks is measuring something synthetic.

Rung M4: a process pool for the last third — 0.23 s, 297.9x

The final rung replaces threads with processes — same compiled kernel, orchestrated by a ProcessPoolExecutor:

with ProcessPoolExecutor(max_workers=12, initializer=_init_worker,
                         initargs=(close,)) as ex:          # ship data ONCE
    list(ex.map(_warmup_worker, range(12 * 3)))             # JIT-warm every worker
    results = list(ex.map(_run_one_combo, grid, chunksize=1))

0.23 s — 297.9x over pandas, 340.9 combos per second. Read that throughput again: this laptop is now running roughly 340 full 150,000-bar backtests per second, each computing seven weighted moving averages and simulating tens of thousands of stateful trades.

The edge over prange is real but modest — about 1.4x (derived: 0.32/0.23) — and the plausible mechanics are scheduling and memory isolation: with chunksize=1 the pool hands out combos one at a time, so the ragged mix of cheap and expensive windows load-balances dynamically across the asymmetric cores, and each worker process gets its own allocator, sidestepping contention on the per-combo temporaries. We report these as mechanics consistent with the measurement, not as separately proven facts.

Processes are not free, and the harness pays their costs honestly outside the timer where they are one-time costs (worker startup, shipping close to every worker via the initializer, per-worker JIT warm-up) — because in a real search those costs amortize over thousands of combos, not eighty. The honest general guidance: prange is simpler and usually enough; a process pool wins when tasks are chunky, the grid is large, or your per-combo work holds the GIL somewhere numba can't reach.

And with that, the ladder factors into a clean summary. From M0 to M2 — the engine: 35.3x on a single core, from moving iteration out of the interpreter. From M2 to M4 — the orchestration: another 8.4x (derived: 1.98/0.23), from using the cores that were already there. Multiplied: 298x. No new hardware, identical results. And measured from the competent M1 baseline instead of the naive one, the finished engine still stands about 13x higher (derived: 3.07/0.23) — the ladder is not an artifact of picking a slow starting point.

Why not a GPU — the honest version

A GPU sitting idle beside a saturated CPU: batchable moving-average math left on the CPU because a sweep of eighty combos and a quarter of a second is too narrow and too short to pay for the trip

"Just port it to a GPU" is the most common response to a slow parameter sweep, so this experiment measures the two numbers that conversation should start from — and neither supports the lazy version of either answer.

The roofline model (Williams, Waterman & Patterson, 2009) classifies a kernel by its arithmetic intensity — FLOPs per byte moved. For the WMA feature stack in this sweep, counting $2p$ FLOPs per bar per window of length $p$ against one 8-byte read per bar, the whole 80-combo sweep works out to about 6.2 GFLOP over 576 MB streamed:

$I = \frac{6.21 \times 10^9\ \text{FLOP}}{5.76 \times 10^8\ \text{bytes}} \approx 10.78\ \frac{\text{FLOP}}{\text{byte}}$

(That is the idealized count over the six distinct WMA windows per combo; counting the seven passes as actually executed gives 11.07 FLOP/byte. Same conclusion either way.)

That number matters because of what it rules out: the popular claim that backtest math is "memory-bound, so GPUs can't help" is false here. At ~10.8 FLOP/byte the feature math is decidedly compute-ish — well past the ridge point where typical hardware stops being bandwidth-limited. A GPU absolutely could batch 80 combos × 7 WMA passes into a handful of large kernels and chew through the arithmetic. If the feature stack were the whole problem, the GPU case would be respectable.

The second measured number kills the other lazy answer — the one we would have reached for ourselves. Timing the feature stage separately from the trade loop inside the compiled kernel gives a split of 99.3% features, 0.7% event loop. The tempting argument — "backtests have a stateful, branchy event loop, and that is what blocks the GPU" — is quantitatively wrong here: the CPU spends essentially all of its time in exactly the part a GPU could batch. Recast 80 combos × 7 WMA passes as large batched convolutions and you have a perfectly reasonable tensor workload. So the honest question is not whether the work could go to a GPU — most of it could. The question is whether the trip pays, and for this sweep it does not, for two specific reasons:

1. The exploitable width is 80 combos — and a GPU is a width machine. The one honest axis of parallelism in a parameter sweep is the grid itself: within a combo, the 150,000-bar path is sequential. A GPU wants tens of thousands of independent work items to fill its lanes and hide latency; this sweep offers eighty. Twelve CPU cores already saturate that width — that is literally what rungs M3–M4 measured. For the combo counts where a GPU's width would even start to engage, the CPU ladder is already delivering hundreds of full backtests per second.

2. The whole job is 0.23 seconds. At M4 speed a combo costs about 2.9 ms (derived: 0.23 s / 80). Against that budget, kernel-launch latencies and device synchronization points are not amortizable rounding errors — they are a material fraction of the job. (On this unified-memory Apple machine, host-to-device transfer is a minor concern; on a discrete-GPU CUDA box it joins the bill as well.) The classic GPU win amortizes fixed overheads over enormous batches of work; a sub-second sweep never produces one.

And the event loop? It is the one part that would not batch — serial, branchy, path-dependent, a loop-carried dependency 150,000 bars long that no hardware can parallelize within a combo, with exactly the divergent branches SIMT lanes hate. A GPU port would leave it on the CPU or run it one lane per combo. But at 0.7% of the kernel, it is an Amdahl term too small to decide anything. It is the part that wouldn't go; it is not the reason not to go. (Recall from rung M1 that for feedback-free kernels the loop can even be analytically vectorized — the rewrite you lose the moment the strategy grows a stop.)

One platform footnote for completeness: on this machine (Apple Silicon) the GPU path would be MLX or PyTorch-MPS, not CUDA — cupy and the CUDA ecosystem simply do not apply — and either would require rewriting the hot path in a tensor dialect just to attempt the experiment. That is a real cost with, per the analysis above, no identified payoff for this sweep's shape. The GPU discussion here is analytical, grounded in the measured arithmetic intensity and the measured feature/loop split, and we label it as such: no CUDA run was performed because none was possible on the disclosed hardware.

The summary sentence we would defend in review: nearly all of this work could go to a GPU; this sweep is too narrow and too short for the trip to pay. And read that in both directions — it is not a write-off. The batched "big-matrix" reformulation — recasting the sweep as large tensor operations across thousands of combos at once, or a genuinely feedback-free kernel that batches end to end — is a real and promising direction that deserves a dedicated study, not a dismissal. At 80 combos and 0.23 seconds, it simply hasn't earned the ticket yet. If your workload has that width, the arithmetic changes, and you should redo it, not quote us.

Where the real bottleneck is: engine and orchestration

The real bottleneck revealed: an hourglass where the engine and the orchestration of thousands of parameter combos choke the flow, not the hardware underneath

Eighty combos is a demonstration grid. Real parameter search is where these factors stop being academic, because grids grow multiplicatively: four parameters at ten values each is $10^4$ combos; add walk-forward validation with a dozen folds and you are at $1.2 \times 10^5$ full backtests before you have explored anything. This is the curse of dimensionality, and it is why search strategy — Optuna, coordinate descent, Sobol — gets so much attention: smarter search visits fewer points.

But the ladder exposes the other, less discussed half of the equation: the cost per visited point. Extrapolating the measured throughputs linearly (combos are independent, so this is arithmetic, not modeling):

Grid size	At M0 (1.1 combos/s)	At M4 (340.9 combos/s)
10,000 combos	~2.4 hours	~30 seconds
100,000 combos	~24 hours	~5 minutes

The same experiment that is an overnight batch job on the naive engine is an interactive query on the tuned one. That difference compounds in a way wall-clock tables understate: at 5 minutes per sweep you iterate — you re-run with a fixed leak, you add a fold, you widen the grid, you test the idea that came to you at lunch. At 24 hours per sweep, you don't. The engine's speed sets the research loop's tempo, and the research loop's tempo is the actual product.

There is an Amdahl's-law reading of the whole ladder, too:

$S = \frac{1}{(1 - p) + p / s}$

Speeding up any single stage $p$ by factor $s$ is bounded by everything else you left slow. The ladder respected that ordering: the 35.3x engine gain attacked the term that dominated (interpreted iteration, in the feature stack and the loop alike), and the 8.4x orchestration gain attacked the term that dominated after that (eleven idle cores). The feature/loop split is the same lesson in miniature — we could not have named the GPU argument's real shape without measuring where the time actually went. Profile, then optimize — in that order. The same logic governs the data layer upstream of the engine: our Polars vs pandas benchmarks found the identical pattern (10–3500x on grouped rolling pipelines) for the load-and-transform half of the stack, and the same hybrid conclusion — columnar engines for the pipeline, a compiled kernel for the path-dependent simulation.

Two honesty notes to close the loop on generality. First, this experiment is deliberately self-contained and synthetic — seeded data, one kernel, one disclosed machine — so anyone can reproduce the phenomenon deterministically; wall-clock numbers will differ on your hardware, but the equivalence and the ladder's direction will not. Second, the phenomenon is not an artifact of the synthetic setup: our production HMA engine's benchmark (bench_param_sweep.py, run on real exchange data with the full production fee and fill model) shows the same ladder shape, with the numba path landing roughly 100–200x above the naive pandas profile. The self-contained experiment exists so you don't have to take our production numbers on faith.

Takeaways

The ladder is 298x, and it factors: 35.3x engine × 8.4x orchestration. Moving iteration out of the interpreter (pandas → numba) and spreading independent combos across cores (one → twelve) multiplied into a three-orders-of-magnitude-adjacent speedup on an unchanged laptop. 69.92 s → 0.23 s; 1.1 → 340.9 combos/s. And it is not a slow-baseline artifact: against the competent vectorized numpy implementation, the finished engine is still ~13x.
Demand equivalence before you admire speed. Every rung here produces identical per-combo PnL and trade counts, gated automatically on all 80 combos (absolute $10^{-6}$ tolerance on PnL, exact on trades). A fast engine that computes something subtly different is not fast — it is wrong at high throughput, and vectorized rewrites are where the wrongness usually sneaks in.
@njit beats clever vectorization for stateful logic. The numpy rung needed a strategy-specific closed form that dies the moment you add a stop-loss. The numba rung compiles the naive, auditable loop — same speed class, none of the fragility, and it is the unit that parallelizes.
The GPU answer is "not for this sweep" — for reasons you should be able to name. The feature math is compute-ish (10.78 FLOP/byte) and it is 99.3% of the compiled kernel, so neither "backtests are memory-bound" nor "the stateful loop dominates" survives measurement. The honest reasons are width and budget: 80 combos of exploitable parallelism that 12 CPU cores already saturate, and a 0.23 s total job that launch and synchronization overhead would eat. The batched big-matrix reformulation at real width remains a promising direction, not a refuted one.
Engine speed is research tempo. At naive-engine throughput, a 100,000-backtest search is a day; at ladder-top throughput it is five minutes. Before buying hardware or renting a cluster, check whether your bottleneck is silicon at all — ours was a lambda inside rolling.apply and eleven idle cores.

The full experiment — all five implementations, the equivalence harness, the roofline computation, and every number in this article regenerable from one deterministic script — is in the companion paper at speed-ladder.marketmaker.cc, with code and data at github.com/suenot/backtest-speed-ladder.

The sweep that took seventy seconds takes a quarter of one. Same trades, same PnL, same laptop. The GPU you were about to requisition can wait; the interpreter loop you were about to ship cannot.

The Backtest Speed Ladder: 298x on a Laptop CPU, Identical PnL to the Last Trade

One kernel, five implementations

Rung M0: the naive pandas profile — 69.9 s

Rung M1: numpy — stop calling Python in a loop — 3.07 s, 22.7x

Rung M2: numba — compile the loop you actually want to write — 1.98 s, 35.3x

Rung M3: prange — the parallelism you already had — 0.32 s, 217.6x

Rung M4: a process pool for the last third — 0.23 s, 297.9x

Why not a GPU — the honest version

Where the real bottleneck is: engine and orchestration

Takeaways

Authors

Read More

Adaptive Drill-Down: Backtest with Variable Granularity from Minutes to Raw Trades

Aggregated Parquet Cache: How to Speed Up Multi-Timeframe Backtests by Hundreds of Times

Walk-Forward Optimization: The Only Honest Strategy Test

One kernel, five implementations

Rung M0: the naive pandas profile — 69.9 s

Rung M1: numpy — stop calling Python in a loop — 3.07 s, 22.7x

Rung M2: numba — compile the loop you actually want to write — 1.98 s, 35.3x

Rung M3: prange — the parallelism you already had — 0.32 s, 217.6x

Rung M4: a process pool for the last third — 0.23 s, 297.9x

Why not a GPU — the honest version

Where the real bottleneck is: engine and orchestration

Takeaways

Authors

Read More

Adaptive Drill-Down: Backtest with Variable Granularity from Minutes to Raw Trades

Aggregated Parquet Cache: How to Speed Up Multi-Timeframe Backtests by Hundreds of Times

Walk-Forward Optimization: The Only Honest Strategy Test

Mantente a la vanguardia

¡Éxito!

Sign In