The IPC Tax: Put the Backtest Engine Behind a Socket and Lose 13%

Part of the "Backtests Without Illusions" series.

📄 This article grew into a research paper. One path-dependent backtest kernel is ported line-for-line from numba to Rust and called across a process/language boundary four ways, with an equivalence gate confirming identical per-combo PnL — plus isolated measurements of the pure IPC latency curve, the serialization tax, and the spawn cost. Read the paper online (interactive version + PDF) at ipc-tax.marketmaker.cc, code and data at github.com/suenot/ipc-tax.

Every backtest engine that gets fast eventually provokes the same conversation. Ours arrived on schedule. The speed ladder had just taken an 80-combo parameter sweep from 69.9 seconds of pandas down to about 2 seconds of single-threaded numba, and the natural next itch was: why stop at a Python JIT? Rewrite the kernel in Rust. Make it a proper engine service — one compiled binary behind a socket, callable from every research script, every language, and the live trader too. One kernel, one truth, no duplicated logic.

And then the counter-argument arrives, also on schedule: the moment you leave the process, IPC eats you. The data must be serialized, shipped across a boundary, deserialized; every call pays syscalls and context switches; your beautiful Rust kernel will spend its life waiting on a pipe. Stay in-process. Everyone knows this.

This article measures the thing everyone knows, and the measurement is more interesting than either side of the argument. The folk belief — "a faster cross-language engine loses to in-process numba because IPC kills you" — turns out to be wrong in general and right only under specific conditions. Crossing the boundary once, in raw bytes, costs about 2 milliseconds on a two-second job: a rounding error. The tax is not in the boundary. It is in how you cross it — and the three ways engine services usually get deployed in the wild (a JSON API, a call per unit of work, a process spawn per call) are each, measurably, a piece of the disaster the folklore predicts.

Here is the whole experiment up front. Everything below is the anatomy of each line.

Architecture	What crosses the boundary per sweep	Wall time	vs in-process
in-process numba	nothing — a direct call	2.010 s	1.00x
Rust server, batched (Unix socket)	one round-trip: the whole series + all 80 param sets	2.276 s	1.13x
Rust server, batched, `get_unchecked` kernel	same single round-trip — a bounds-check-free kernel variant (see the verdict)	2.337 s	1.16x
Rust server, chatty (Unix socket)	80 round-trips: the series re-shipped per combo	2.383 s	1.19x
Rust spawn (stdin/stdout)	process spawn + one piped request	2.300 s	1.14x

Apple M2 Max, Python 3.14.6, numpy 2.4.3, numba 0.64.0, rustc 1.94.0 (release build, zero external crates). 150,000 bars × 80 combos, 0.09% round-trip fee, seed 42; the close series is 1,200,000 bytes (1.2 MB) on the wire. Median of 10 runs per architecture; min–max spreads stay within ~2%. All five run the same HMA/HMA3 stop-and-reverse sweep, and an equivalence gate confirms that both Rust kernel variants' per-combo (PnL, trade count) results match numba exactly — fingerprint PnL −5165.58 across 57,029 trades, byte-identical to the speed-ladder study's numba kernel on the same seed. We are comparing boundaries, not implementations.

Read the batched row carefully, because it carries the whole thesis. The Rust-over-a-socket architecture is 1.13x slower than in-process numba — 266 ms behind on the full sweep (derived: 2.276 − 2.010). The folk story says those milliseconds are IPC. They are not. About 2 ms of that gap is the boundary — the entire 1.2 MB close series shipped in, results shipped back, measured directly. The other ~264 ms is that our naive Rust kernel simply computes the sweep about 13% slower than the numba kernel (derived: 2.276 s minus ~2 ms of boundary ≈ 2.274 s of Rust compute, vs 2.010 s for numba). Rust-the-language did not lose to Python-the-language; one scalar LLVM-compiled loop lost a codegen race to another — and we could not even pin the loss on the obvious suspect: a bounds-check-free get_unchecked build of the same kernel came out no faster (2.337 s; the verdict section dissects this). The socket had almost nothing to do with any of it.

Hold both halves of that sentence. The boundary is nearly free when crossed correctly — and "rewrite it in Rust" buys you a deployment boundary, not an automatic compute win. Both facts run against popular instinct, and both are in the table.

One kernel, two languages, four boundaries

The workload is deliberately the same one the speed ladder pinned down, so the two studies anchor to one another. The kernel is an HMA/HMA3 cross — a stop-and-reverse system on two Hull-style moving averages, seven weighted-moving-average passes per parameter combination plus a stateful bar-by-bar event loop that carries a position, books PnL minus a 0.09% round-trip fee on every cross, and reverses. The data is 150,000 bars of seeded synthetic geometric Brownian motion (seed=42); the grid is 80 HMA lengths spread over $[6, 200]$ . The in-process reference is the ladder's single-threaded numba rung, re-measured for this study: 1.98 s there, 2.010 s here — same kernel, same machine, reassuringly boring.

The cross-language engine is a line-for-line port of that numba kernel to Rust — same loops, same NaN handling, same fee arithmetic — compiled in release mode with no external crates, so the whole experiment stays dependency-free and reproducible. It speaks a deliberately minimal binary protocol: one length-prefixed frame each way, everything little-endian.

request:  [u32 body_len][body]
body:     [u8 opcode][u32 n_bars][u32 n_combos]
          [n_bars × f64 close][n_combos × 6 × i64 params]

opcode 0 = sweep : reply = [n_combos × f64 pnl][n_combos × i64 trades]
opcode 1 = echo  : reply = the close array, verbatim

The echo opcode is the study's scalpel: a round-trip of controllable size that computes nothing, so the pure boundary cost can be measured in isolation — serialization, syscalls, socket transit, deserialization, and nothing else.

Five measured architectures — four boundary patterns plus one kernel variant:

in_process — call the numba kernel directly. No boundary. The reference.
rust_batch_unix — a persistent Rust server on a Unix domain socket. One round-trip ships the entire close series plus all 80 parameter sets; Rust computes every combo; one reply comes back. The chunky call.
rust_batch_unchecked — the same batched boundary, but the kernel indexes with get_unchecked (no bounds checks in the hot path). It exists to test a specific hypothesis about the compute gap; the verdict section spends it.
rust_chatty_unix — the same server, but one round-trip per combo, the 1.2 MB series re-shipped every time. The naive RPC-per-unit-of-work architecture.
rust_spawn_stdin — spawn the binary per sweep and pipe the request over stdin. The "shell out to a CLI engine" pattern; pays process creation.

And the equivalence gate, without which none of this would mean anything: after timing, each Rust variant's per-combo (PnL, trade count) vector is compared against numba's — trade counts exact, PnL to an absolute $10^{-6}$ . The committed run reports all_ok: true for both the safe-indexing and the get_unchecked builds. The first-combo fingerprint — PnL −5165.58 percentage points across 57,029 trades — matches the speed-ladder study's numba kernel digit for digit, which pins both papers to the same kernel on the same seed. Cross-language ports are precisely where silent divergence loves to live (a fee applied before instead of after the percent conversion, a NaN comparison that branches differently, an off-by-one in a window — the same species of bug our look-ahead taxonomy showed can manufacture a Sharpe of 15 from noise). A benchmark of two engines that compute different things is not a benchmark; it is two unrelated programs racing.

With equivalence established, every difference in the table above is boundary and compute — nothing else.

What crossing actually costs: the echo curve

The measured cost of a boundary crossing: a latency curve flat at fourteen microseconds for tiny payloads, bending upward only past ten thousand floats, reaching two milliseconds for the full 1.2-megabyte series

Start with the scalpel. The echo op round-trips a payload of $n$ floats through the Rust server — Python builds the frame, the server parses all $n$ floats, re-encodes them, and ships them back. Both directions pay serialization, syscalls, and socket transit. Here is the measured curve (medians over 10 runs):

Payload (floats)	Bytes each way	Round-trip
1	8	14.1 µs
100	800	16.4 µs
1,000	8,000	18.1 µs
10,000	80,000	192.5 µs
100,000	800,000	1,367.3 µs
150,000	1,200,000	2,043.4 µs

Two structural facts live in this table.

First, the floor. A round-trip carrying essentially nothing — 8 bytes — costs 14 µs. That is the irreducible price of making a call at all over this transport: two write syscalls, two read syscalls, kernel socket machinery, scheduler wake-ups. Note how flat the curve is at the left: from 1 float to 1,000 floats the cost barely moves (14.1 → 18.1 µs). Below about 8 KB you are paying for the call, not the bytes. This number — the latency floor — is the single most important constant in the whole study, and we will build the break-even arithmetic on it below.

Second, the slope. Past ~10,000 floats the curve goes bandwidth-bound and roughly linear. The full 1.2 MB series — 2.4 MB moved in total, out and back, including a full parse and re-encode of 150,000 floats on the Rust side — costs 2,043.4 µs. That works out to an effective ~1.2 GB/s through the whole naive stack (derived: 2.4 MB / 2.04 ms) — a Unix domain socket with length-prefixed frames and a byte-by-byte float parser, no zero-copy tricks, no shared memory, nothing clever.

A reasonable model of a single crossing, with both constants measured:

$T_{\text{call}}(b) \;\approx\; \underbrace{14\ \mu\text{s}}_{\text{floor}} \;+\; \underbrace{\frac{2b}{1.2\ \text{GB/s}}}_{\text{payload, both ways}}$

Now put the headline number in context. The full sweep takes 2.010 s in-process. Shipping its entire dataset across the boundary and back costs ~2.0 ms — about 0.1% of the job (derived: 2.0434 ms / 2.010 s). If you cross once, in raw bytes, the boundary is a rounding error. That is the half of the folk belief that dies first: the fear was never about anything this cheap.

The Rust side of that crossing is about as unglamorous as systems code gets — adapted from engine/src/main.rs:

fn read_frame<R: Read>(r: &mut R) -> Option<Vec<u8>> {
    let mut len_buf = [0u8; 4];
    r.read_exact(&mut len_buf).ok()?;
    let len = u32::from_le_bytes(len_buf) as usize;
    let mut body = vec![0u8; len];
    r.read_exact(&mut body).ok()?;
    Some(body)
}

fn write_frame<W: Write>(w: &mut W, body: &[u8]) {
    w.write_all(&(body.len() as u32).to_le_bytes()).unwrap();
    w.write_all(body).unwrap();
    w.flush().unwrap();
}

// the server is a loop: read frame -> compute -> write frame
for stream in listener.incoming() {
    serve_stream(stream.unwrap());
}

One honest scope note before moving on: all boundary numbers in this study are a Unix domain socket on one host. The engine also speaks TCP (with TCP_NODELAY), but we did not measure it; loopback TCP sits somewhat above these floors, and an actual network hop is a different regime entirely — milliseconds of floor, not microseconds. Everything here is therefore the near-best case for crossing a boundary this way. Which makes the taxes measured next all the more damning: they are what you pay on top of that, by choice.

The serialization tax: 1348x for choosing JSON

Two encodings of the same 150,000-float array side by side: a raw-bytes memcpy measured in microseconds against a JSON text encoding towering three orders of magnitude taller

Here is where the folk belief about "IPC overhead" turns out to be a mislabeling. We measured the cost of encoding the same 150,000-float close series three ways — the exact payload every architecture above ships:

Encoding	Time to encode 1.2 MB of floats	vs raw
raw bytes (`.tobytes()`)	49.1 µs	1.0x
pickle	29.8 µs	0.6x
JSON (`json.dumps(close.tolist())`)	66,243 µs	1348x

The raw path is a memcpy wearing a function call:

def build_request(opcode, close, params):
    body = bytes([opcode]) + struct.pack("<II", len(close), len(params))
    body += close.astype("<f8").tobytes()      # 150,000 floats -> 1.2 MB in 49 µs
    body += np.asarray(params, dtype="<i8").reshape(-1).tobytes()
    return struct.pack("<I", len(body)) + body  # length-prefixed frame

(Pickle lands even slightly cheaper than our raw path because astype pays a dtype-conversion copy even when the dtype already matches; both are memcpy-class and both are rounding errors. The binary family as a whole lives three orders of magnitude below the text family.)

And the text path is what nearly every "let's make the engine a microservice" deployment actually ships:

body = json.dumps({"op": "sweep", "close": close.tolist(), "params": params})

Sixty-six milliseconds. To encode. json.dumps(close.tolist()) boxes every float into a Python object, then renders each one as decimal text — 150,000 heap allocations and 150,000 float-to-string conversions where the raw path did one block copy. And the wire payload inflates too (a float64 costs 8 bytes in binary and roughly two to three times that as decimal text — we did not even charge for the extra transit).

Now scale it the way a real deployment does. That 66 ms is one encode, one side, one call. A JSON service pays encode and decode, on both sides of the boundary, on every call. A single batched call over JSON would burn ~3.3% of the entire sweep's compute budget on client-side encoding alone (derived: 66 ms / 2.010 s). Put JSON under the chatty architecture — one call per combo, the pattern below — and the client-side encoding alone costs 80 × 66 ms = 5.3 s: more than two and a half times the entire useful job (derived), before a single byte moves and before the server parses anything.

This is the actual "IPC tax" most teams have measured in production without knowing it. It was never inter-process communication. It was text serialization of numeric arrays — a self-inflicted 1348x on the boundary's cheapest component. The columnar world learned this lesson years ago, and it is the same one our Polars vs pandas study kept running into from the data-pipeline side: formats like Arrow exist precisely so that array data can cross process and language boundaries as raw columnar bytes, not as text. If your engine service speaks JSON for price arrays, no socket tuning will save you — the protocol is the bottleneck.

Chatty vs chunky: Fowler's law, measured

A chunky architecture shipping one large framed payload across the boundary once, beside a chatty architecture making eighty small round-trips that each drag the full dataset along

Martin Fowler's First Law of Distributed Object Design — "don't distribute your objects" — comes with a corollary he spelled out in the same breath: if you must cross a boundary, the interface has to be coarse-grained, because a remote call costs orders of magnitude more than a local one. Every distributed-systems veteran nods along. Almost nobody has a number for their own workload. Here is ours.

The chunky and chatty architectures run the same server, same protocol, same data — only the call granularity differs:

srv.call(0, close, params)

[srv.call(0, close, [params[k]]) for k in range(n)]

Chunky: 2.276 s (1.13x). Chatty: 2.383 s (1.19x) — 107 ms slower (derived: 2.383 − 2.276). To be precise about what that delta is and is not: the echo curve gives a naive prediction for it — 79 extra ships of the full series at roughly half the 2,043 µs full-payload round-trip each, about 81 ms — which lands some 25% below the measured 107 ms; the remainder is per-call request building and framing on the Python side, which the echo prediction does not include. Either way it comes to ~1.4 ms per extra crossing (derived: 107 / 79); the replies are negligible — 16 bytes per combo.

Two readings of that 107 ms, and both matter.

The lenient reading: it is only ~4.5% of the wall, not a catastrophe. True — and worth understanding why the folklore's disaster failed to materialize here. Each chatty call still carries 25,130 µs of real compute (one combo's worth — the measured in-process per-combo cost), so the per-call boundary overhead of ~1.4 ms stays an order of magnitude below the per-call work. Chatty architectures are not fatal when each call is genuinely heavy. They become fatal as granularity shrinks — which is the break-even section's whole subject.

The damning reading: this tax was entirely voluntary, and it scales with call count × payload. The chatty pattern re-ships the dataset on every call for one reason only: the service is stateless, so every request must carry all context. That is the default shape of a naive "sweep endpoint" — and of essentially every REST microservice ever sketched on a whiteboard. A stateful server — load the series once, then send 48-byte parameter frames — would put each per-combo call near the tiny-payload end of the echo curve: about 16 µs per call, roughly 1.3 ms for all 80 (derived from the echo floor; analytical, not separately measured). The chatty penalty would not shrink; it would vanish. The lesson is precise: the problem is not making many calls — it is re-shipping state because the protocol pretends every call is the first.

Preload the data. Ship parameters. Cross the boundary with intent, not with the whole world in your suitcase every time.

The spawn cost: renting the engine by the call

An engine binary being spawned from scratch for a single request: process creation, loader, and pipe setup stacked as a fixed toll booth in front of a short stretch of useful work

The third deployment pattern is the oldest: no server at all. Spawn the engine binary, pipe one request over stdin, read the reply from stdout, let it die. Every shell scripter's instinct, every "just call the CLI from Python" integration, every hyperparameter framework configured to launch a binary per trial.

Measured: 2.300 s (1.14x) — about 24 ms over the persistent-server batch (derived: 2.300 − 2.276). Those 24 milliseconds buy a fork/exec, the dynamic loader, pipe setup, and process teardown. And note what this measures is close to the floor for the pattern: a small dependency-free native binary, warm in the page cache. Spawning anything with a runtime — a JVM, a Python interpreter with imports — costs far more; we did not measure those here, but the direction is not in doubt.

The structure of this tax is what matters: it is fixed per call, indifferent to how much work the call carries. Amortized over a full 80-combo sweep, 24 ms is about 1% — noise. Respawn per combo and the same constant becomes 80 × ~24 ms ≈ 1.9 s — essentially the entire useful job burned on process creation (derived; analytical). Respawn per bar and the arithmetic does not bear writing down.

Fixed cost, fine granularity: pick one. The pattern that pays a spawn is only sane when the spawn is rare and the payload behind it is enormous — exactly like our one-spawn-per-sweep measurement, and exactly unlike the way per-symbol-subprocess architectures end up being used once the symbol count grows.

The break-even arithmetic: a floor is a hurdle rate

Break-even arithmetic on a balance: fourteen microseconds of boundary floor on one side weighed against the compute each call carries, with per-combo calls far above water and per-bar calls drowned

Everything measured so far compresses into one design rule, and the rule is arithmetic, not opinion.

Every boundary crossing costs at least the latency floor — 14 µs here, the tiny-payload echo round-trip, and close to the best this transport offers. That floor is a hurdle rate: a call across the boundary is only worth making if the compute it ships clears the hurdle by a comfortable multiple. Define the granularity ratio

$G \;=\; \frac{T_{\text{compute per call}}}{T_{\text{floor}}}$

and the boundary's share of your wall time is roughly $1/(1+G)$ — with payload transit on top if the call also carries data.

Now run the sweep's numbers through it. The measured in-process cost of one combo is 25,130 µs. At per-combo granularity:

$G \;=\; \frac{25{,}130\ \mu\text{s}}{14\ \mu\text{s}} \;\approx\; 1795$

Per-combo calls sit ~1,795x above the floor — the boundary claims well under a tenth of a percent per call. This is why even the chatty architecture only lost 107 ms: at this workload's granularity, every crossing pattern that doesn't re-ship data or speak text is safely amortized. Combo-level, fold-level, sweep-level calls are all deep in the cheap zone.

Now flip to the opposite extreme. This one is an illustrative cross-workload extrapolation — not a variant of our sweep, but a workload shape that genuinely exists in the wild: the engine is consulted per bar. A live-style per-tick engine service; a gRPC-per-bar signal stream; a "strategy server" polled once for every one of 150,000 bars. The useful compute per bar in this kernel is 25,130 µs / 150,000 ≈ 0.17 µs (derived) — each call would carry about 1/84 of its own boundary cost in useful work (derived: the 14.05 µs floor over 0.168 µs of compute). The total is worse than the ratio sounds:

$150{,}000 \ \text{calls} \times 14\ \mu\text{s} \;\approx\; \mathbf{2.1\ s\ of\ pure\ IPC}$

— more than the entire 2.010 s in-process job, spent before the remote engine computes a single number, and it would remain 2.1 s even if the engine on the other side were infinitely fast (derived: 150,000 × 14 µs). No compute advantage survives a granularity that fine. And recall this floor is a Unix socket on one host; make that per-bar call to a service across a network and the floor grows by two to three orders of magnitude, on 150,000 calls.

The same-machine boundary floor as an implementation choice: a Python-over-Unix-socket round-trip at fourteen microseconds towering over a shared-memory ring crossing at thirty-nine nanoseconds, three orders of magnitude apart

One more honest calibration, because 14 µs is not a law of physics either — it is the price of our transport: a Python client, a kernel socket, syscalls in both directions. A purpose-built same-machine transport goes far lower. ZigBolt — our open-source Zig messaging bus for HFT workloads, benchmarked natively on this same machine — does a shared-memory ring round-trip in about 39 ns mean (one-way p50 of 10/20/30 ns at 64/256/1024-byte messages). That is roughly 360x below our socket floor (derived: 14.05 µs / 39 ns). The comparison is deliberately apples-to-oranges, and we flag it as such: our 14 µs is a Python-client socket round-trip, ZigBolt's 39 ns is native Zig over shared memory, so the gap conflates transport and runtime. Read it not as a race between the two but as the range the same-machine floor can occupy: about three orders of magnitude, chosen by implementation. This is the old Lightweight RPC lesson (Bershad et al., 1990) in modern dress — same-machine crossings are dominated by protocol machinery, and they collapse when the transport is built for the same-machine case. The break-even arithmetic above does not change shape; the hurdle just moves. At a 39 ns floor, even per-bar granularity would clear it (150,000 × 39 ns ≈ 5.9 ms, derived) — which is precisely how HFT systems can afford boundaries that a REST service cannot.

This is the whole break-even story in one sentence: the boundary does not care how fast your engine is; it charges per crossing, so the variables you control are how much work each crossing carries — and what the crossing is made of. Batch per sweep and $G$ is over a hundred thousand. Batch per combo, $G \approx 1795$ — still fine. Call per bar over a socket, $G < 1$ — the architecture is dead before the first optimization, and no rewrite of the engine, in Rust or anything else, can resurrect it.

Where the 1.13x actually lives — and the verdict

The 266-millisecond gap dissected: a sliver of two milliseconds labeled as the boundary next to a large slab of measured codegen difference between two scalar compiled kernels, with the folk belief crossed out

Time to dissect the headline gap honestly, because it carries the study's most counterintuitive finding.

The batched Rust architecture trails in-process numba by 266 ms (derived: 2.276 − 2.010). The measured boundary components: one full-payload round trip at ~2.0 ms, raw serialization at 49 µs, frame headers at a handful of bytes — call the entire boundary bill ~2 ms. Over 99% of the gap is therefore not the boundary at all. It is compute: stripped of IPC, the Rust server spends ~2.274 s doing the sweep that numba does in 2.010 s — the naive Rust kernel is about 13% slower at raw compute (derived).

That deserves an unflinching paragraph, because "rewrite it in Rust and it'll be faster" is as much folk belief as "IPC will kill you." Both kernels bottom out in LLVM — numba lowers Python bytecode through it, rustc lowers MIR through it — and both most likely run as scalar loops: the WMA's inner sum is a floating-point reduction, which LLVM will not auto-vectorize without the fast-math reassociation license that numba's @njit defaults do not grant and our port does not request. So the ~13% is a measured codegen gap between two scalar LLVM-compiled loops — and rather than assert a cause, we tested the obvious one. The natural suspect is Rust's safe indexing: the hot WMA loop bounds-checks every array access, where numba's @njit compiles with bounds checking off. So we built an equivalence-verified variant of the same kernel on get_unchecked — no bounds checks anywhere in the hot path — and timed it as a fifth architecture. It did not close the gap: 2.337 s (1.16x), marginally slower than the bounds-checked build's 2.276 s. Hypothesis tested, hypothesis rejected. The honest state of knowledge: the ~13% is real and reproducible (medians over 10 runs, spreads within ~2%), and currently unattributed — some difference in allocation behavior, loop structure, or instruction scheduling that only assembly-level profiling would settle. The lesson survives intact: naive Rust is not automatically faster than good numba, and a language boundary purchased on the assumption of a free compute win can arrive with a compute loss attached. A tuned Rust kernel — preallocated buffers, explicit SIMD, threads across combos — could still flip the sign. But that is a compute question, to be settled by profiling and kernel work, and this study's question is the boundary. The boundary's answer: crossed once, in bytes, it costs ~0.1%.

So assemble the full verdict, every clause of it measured above.

A cross-language engine service wins when all of these hold:

The compute advantage is real — measured on your kernel, not assumed from the language's reputation. (Ours was −13% until proven otherwise — and the first "obvious" explanation for that deficit died in testing.)
You cross coarsely — one call per sweep or per fold, thousands of multiples above the 14 µs floor, the way the batch architecture's 1.13x total (~0.1% boundary) demonstrates.
You speak binary — length-prefixed raw arrays, Arrow, anything memcpy-class at 49 µs per 1.2 MB; never text at 66,243 µs.
The data is preloaded — a stateful server takes params-only calls at the ~16 µs end of the echo curve instead of re-shipping megabytes.

It loses when deployed the way engine services usually are:

A JSON/REST microservice — pays the 1348x serialization tax on every call, both directions; under chatty granularity that is 5.3 s of encoding on a 2 s job.
RPC per unit of work — per combo it costs 107 ms here and survives only because each call carries 25,130 µs of compute; per bar it is ~2.1 s of pure IPC before any work happens, on a 2.0 s job.
A spawn per call — ~24 ms of fixed cost each time, harmless once per sweep, nearly two seconds when paid per combo.

Which is to say: the architectures that fail are not exotic. JSON REST engine, per-symbol subprocess, gRPC-per-tick — that is a fair census of how "let's factor out the backtest engine" actually gets built. The folk belief is empirically well-founded as a description of common practice and empirically wrong as a law of nature. The boundary was never the problem. The default ways of crossing it are.

One argument for the boundary deserves its own sentence, because it is the reason we ran this study at all. A single compiled kernel behind a well-designed boundary can serve the research sweep and the live trading loop — the same binary, the same arithmetic, bit for bit. Our backtest-live parity study catalogued how research and production engines drift apart when they are two codebases; an engine service is the strongest structural cure for that drift, and this study prices the cure honestly: done right, about 0.1% of wall time and an equivalence gate to prove nothing changed in translation. That trade — a dedicated process boundary in exchange for one-kernel parity — is, on these numbers, a bargain. Done wrong, the same idea ships a 1348x serialization tax to production with your PnL riding on top of it.

Takeaways

The boundary is nearly free; the folk belief fails measurement. Round-tripping the entire 1.2 MB close series through a Unix socket — full parse and re-encode included — costs 2,043.4 µs, about 0.1% of the 2.010 s job (derived). The batched Rust-over-socket architecture lands at 1.13x total, and ~99% of even that gap is not IPC.
"Rewrite it in Rust" is a compute claim — verify it before buying the boundary. Our line-for-line Rust port computes ~13% slower than the numba kernel (derived: 2.274 s vs 2.010 s) — a reproducible codegen gap between two scalar LLVM-compiled loops that remains unattributed: we tested the obvious suspect and rejected it, since an equivalence-verified get_unchecked build with no bounds checks came out no faster (2.337 s vs 2.276 s). Naive Rust is not automatically faster; a tuned kernel may well be — measure, then decide.
The real tax is text. Encoding 150,000 floats as JSON costs 66,243 µs vs 49.1 µs raw — 1348x, paid per direction, per call, on both sides. A chatty JSON deployment burns 5.3 s of encoding on a 2 s job (derived). Speak binary across boundaries: raw frames, Arrow — never json.dumps on a price array.
Chatty vs chunky is measurable, and statelessness is the culprit. Per-combo calls that re-ship the data: 1.19x vs the batch's 1.13x (+107 ms, derived; the echo curve's one-way prediction of ~81 ms lands ~25% below it, the rest being per-call framing). A preloaded stateful server would take the same 80 calls at ~16 µs each — about 1.3 ms total (derived from the echo floor). Ship parameters, not the dataset.
Respect the floor — and know that the floor is a choice. Our Python-over-Unix-socket crossing floors at 14 µs; per-combo granularity clears it ~1,795x over (25,130 µs of compute per call) — safe. A per-bar pattern (an illustrative cross-workload extreme: a live per-tick engine, not this sweep) would pay 150,000 × 14 µs ≈ 2.1 s of pure IPC on a 2.0 s job (derived) — dead on arrival even with an infinitely fast engine. Spawning per call adds a fixed ~24 ms (derived). And a purpose-built shared-memory transport like ZigBolt round-trips in ~39 ns natively on this machine — ~360x below our socket floor (derived; native Zig vs a Python client, so read it as the range the floor can occupy, not a race).
Cross once, in bytes, with the data already there — and the boundary buys you parity for ~0.1%. One kernel serving research and live, gated by an equivalence check (PnL −5165.58, 57,029 trades, identical across languages and across both Rust builds), is the honest case for an engine service. The dishonest cases — JSON, chatty, spawn-per-call — are the ones that gave IPC its reputation.

The full experiment — the Rust engine, the wire protocol, the echo and serialization harnesses, the equivalence gate, and every number in this article regenerable from one deterministic script — is in the companion paper at ipc-tax.marketmaker.cc, with code and data at github.com/suenot/ipc-tax.

The socket was never the problem. Two milliseconds for the whole dataset, round trip — the folklore was off by three orders of magnitude, and in both directions at once: too pessimistic about bytes, too forgiving of text. Cross the boundary like it costs something, and it won't.

The IPC Tax: Put the Backtest Engine Behind a Socket and Lose 13% — Almost None of It to the Socket

One kernel, two languages, four boundaries

What crossing actually costs: the echo curve

The serialization tax: 1348x for choosing JSON

Chatty vs chunky: Fowler's law, measured

The spawn cost: renting the engine by the call

The break-even arithmetic: a floor is a hurdle rate

Where the 1.13x actually lives — and the verdict

Takeaways

Authors

Read More

The Backtest Speed Ladder: 298x on a Laptop CPU, Identical PnL to the Last Trade

The Probability of Backtest Overfitting: Did Your Search Beat a Coin Flip?

Objective-Function Design: The Metric You Optimize Secretly Picks Your Strategy

One kernel, two languages, four boundaries

What crossing actually costs: the echo curve

The serialization tax: 1348x for choosing JSON

Chatty vs chunky: Fowler's law, measured

The spawn cost: renting the engine by the call

The break-even arithmetic: a floor is a hurdle rate

Where the 1.13x actually lives — and the verdict

Takeaways

Authors

Read More

The Backtest Speed Ladder: 298x on a Laptop CPU, Identical PnL to the Last Trade

The Probability of Backtest Overfitting: Did Your Search Beat a Coin Flip?

Objective-Function Design: The Metric You Optimize Secretly Picks Your Strategy

Mantente a la vanguardia

¡Éxito!

Sign In