ZigBolt: Why We Built Our Own Aeron in Zig and Hit 20 Nanoseconds Per Message
Lock-free ring buffers, zero-copy codecs, Raft cluster — all in pure Zig, all open source.
If you work in algorithmic trading or market making, you know the price of every microsecond. One extra context switch — and your order arrives second. One JVM GC pause — and the market maker on the other side has already updated the quote. In a world where money is measured in nanoseconds, messaging infrastructure isn't a boring pipe between services — it's a competitive advantage.
We built ZigBolt — a messaging system for high-frequency trading written in Zig. From scratch. No JVM, no garbage collector, no Media Driver, no XML configs. And we got 20 nanoseconds p50 latency on an SPSC ring buffer and 30 nanoseconds on IPC via shared memory.
This article covers why we needed it, how it works inside, and why Zig.
TL;DR
- ZigBolt — open-source (MIT) messaging system for HFT in pure Zig
- 20 ns p50 on SPSC, 30 ns p50 on IPC — faster than Aeron's published numbers
- Zero-copy codec runs at 0 ns (compile-time generation, runtime is just a pointer cast)
- No GC, no JVM, no Media Driver — the library embeds directly into your application
- Raft cluster, archive, sequencer — all included
- FFI bindings for Rust, Python, Go, TypeScript, C — work in whatever language you prefer
The Problem: Why Aeron Is Great But Not Enough
Aeron from Real Logic is the de facto standard for low-latency messaging in capital markets. Dozens of HFT firms use it, it's battle-tested, and it has excellent architecture. But Aeron has a fundamental problem, and its name is JVM.
JVM Safepoints: The Invisible Enemy
Even if you carefully put all data in off-heap memory, even if you disabled GC ergonomics and set GuaranteedSafepointInterval=300000 — the JVM still occasionally stops all threads at a safepoint. This isn't a bug, it's an architectural decision: the JVM needs safepoints for deoptimization, biased locking, and stack walking.
In practice it looks like this: your thread sends messages at p50 = 200 ns, and suddenly p99.9 spikes to 50 us. For no apparent reason. Because one of the JVM threads decided it was time.
Media Driver: An Extra Hop
Aeron works through a Media Driver — a separate process (or embedded JVM) that routes messages between publisher and subscriber via shared memory. This gives nice isolation but adds at least one extra hop:
Aeron: App → shm → Media Driver → shm → socket → NIC
ZigBolt: App → ring buffer → io_uring → NIC
Each hop means extra nanoseconds, extra cache misses, extra unpredictability.
SBE: A Separate Build Step
Simple Binary Encoding — the standard FIX codec for financial messages. In the Aeron ecosystem, it's a separate Java utility that generates code from XML schemas. A separate dependency, a separate build step, a separate set of problems.
The Solution: ZigBolt
We asked ourselves: what if we took Aeron's best ideas — triple-buffered log, lock-free ring buffers, Raft cluster — and implemented them in a language that:
- Has no runtime overhead (no GC, no safepoints)
- Allows code generation at compile time (comptime)
- Trivially integrates with C libraries (DPDK, io_uring)
- Compiles to a ~100 KB binary
That language is Zig.
Architecture
┌─────────────────────────────────────────────────────────┐
│ Publisher/Subscriber API (typed generic wrappers) │
├─────────────────────────────────────────────────────────┤
│ Transport Layer (channel factory & lifecycle) │
├─────────────────────────────────────────────────────────┤
│ IPC Channel (shared memory) │ UDP Channel (network) │
├─────────────────────────────────────────────────────────┤
│ WireCodec (comptime, zero-copy) │ SBE Encoder/Decoder │
├─────────────────────────────────────────────────────────┤
│ Ring Buffers (SPSC/MPSC) │ LogBuffer (triple-buffered) │
├─────────────────────────────────────────────────────────┤
│ Archive (replay) │ Sequencer (total order) │ Raft (HA) │
└─────────────────────────────────────────────────────────┘
Seven layers, each usable independently. Just need an SPSC ring buffer for IPC between two processes? Take it. Need a full cluster with Raft consensus and archiving? Also there.
Benchmarks: Numbers, Not Words

Real benchmark results over 10 million iterations (Apple Silicon / macOS):
SPSC Ring Buffer
| Message Size | p50 | p99 | p99.9 | Throughput |
|---|---|---|---|---|
| 8 bytes | 20 ns | 30 ns | 120 ns | 42.8M msg/s |
| 32 bytes | 30 ns | 50 ns | 150 ns | 28.5M msg/s |
| 64 bytes | 50 ns | 60 ns | 320 ns | 17.6M msg/s |
| 256 bytes | 30 ns | 50 ns | 50 ns | 29.5M msg/s |
IPC Channel (shared memory)
| Message Size | p50 | p99 | p99.9 | Throughput |
|---|---|---|---|---|
| 64 bytes | 30 ns | 40 ns | 40 ns | 35.7M msg/s |
| 256 bytes | 40 ns | 40 ns | 170 ns | 27.4M msg/s |
| 1024 bytes | 90 ns | 260 ns | 900 ns | 9.9M msg/s |
LogBuffer (Aeron-style triple-buffered)
| Message Size | p50 | p99 | p99.9 | Throughput |
|---|---|---|---|---|
| 32 bytes | 30 ns | 40 ns | 320 ns | 33.6M msg/s |
| 64 bytes | 30 ns | 30 ns | 160 ns | 38.0M msg/s |
| 256 bytes | 30 ns | 40 ns | 60 ns | 31.1M msg/s |
WireCodec (comptime zero-copy)
| Operation | Latency | Throughput |
|---|---|---|
| Encode (32 bytes) | 0 ns | inlined memcpy |
| Decode (32 bytes) | ~0.4 ns | 2.7 billion msg/s |
Yes, you read that right: encoding takes zero nanoseconds. Because WireCodec(T) validates the struct at compile time and turns encode/decode into a plain @memcpy or pointer cast. Runtime overhead = zero.
For comparison: Aeron claims IPC RTT (round-trip) of ~250 ns. Our one-way latency is 30 ns. Even counting round-trip, we're 4x faster.
How It Works Inside
Lock-Free SPSC: Simplicity as Virtue

A single-producer single-consumer ring buffer is the simplest and fastest lock-free data structure. The writer moves head, the reader moves tail, no CAS needed — acquire/release atomics suffice.
The key trick is cache-line padding. If head and tail sit in the same cache line, every update to one counter invalidates the cache for the other core (false sharing). The fix:
// Head (write position) — on its own cache line
head: std.atomic.Value(usize) align(128) = .init(0),
// 128 bytes padding — guaranteed isolation
_pad0: [128 - @sizeOf(std.atomic.Value(usize))]u8 = .{0} ** ...,
// Tail (read position) — on its own cache line
tail: std.atomic.Value(usize) align(128) = .init(0),
128 bytes, not 64 — because on Apple Silicon (and many ARM chips) the hardware prefetcher can work with pairs of cache lines. We play it safe.
WireCodec: Comptime Instead of Code Generation

In the Java/C++ world, binary codecs require a separate step: write a schema, run a code generator, get code, compile. In Zig, all of this happens at compile time:
const TickMsg = packed struct {
symbol_id: u32,
price: i64,
quantity: u32,
side: u8,
_reserved: [3]u8,
timestamp: u64,
};
const Codec = WireCodec(TickMsg);
// Encode — just a 32-byte memcpy. Inlines to 1-2 instructions.
Codec.encode(&msg, buf[0..Codec.wire_size]);
// Decode — pointer cast. Zero copies.
const tick = Codec.decode(buf[0..Codec.wire_size]);
The Zig compiler validates at comptime:
- The struct is packed (no padding holes)
- Size is a multiple of 8 bytes (alignment for SIMD)
- All fields are primitive types
If something's wrong — compile error, not a runtime exception at 3 AM in production.
IPC via Shared Memory
Two processes map the same file in /dev/shm. Publisher writes to the ring buffer, subscriber reads. No sockets, no system calls on the hot path:
// Publisher
const channel = try IpcChannel.create("/market-data", .{
.term_length = 1024 * 1024, // 1 MB
});
channel.publish(&msg_bytes, msg_type_id);
// Subscriber (another process)
const channel = try IpcChannel.open("/market-data", .{
.term_length = 1024 * 1024,
});
const count = channel.poll(handler_fn, 10);
The entire path from publish() to handler_fn invocation in the subscriber — 30 nanoseconds for a 64-byte message.
NAK-Based Reliability for UDP
For network transport, ZigBolt uses receiver-driven retransmission. The receiver tracks gaps in sequence numbers via bitmap and sends NAK (negative acknowledgement) to the sender. Plus AIMD congestion control — TCP-like slow start and congestion avoidance — to avoid flooding the network.
Raft Cluster: When You Need Consistency
For cases where losing a message is unacceptable (e.g., a matching engine), ZigBolt includes full Raft consensus:
- Leader election with configurable timeout (150-300 ms)
- Log replication — the leader replicates every message to followers
- Write-ahead log with CRC32 validation and crash recovery
- Snapshots — so the WAL doesn't grow forever
Archive: Record and Replay
All messages can be recorded to a segmented on-disk archive. Then — replay from any position by time or sequence number. Built-in LZ4-style compression with no external dependencies. Sparse index for fast lookup within segments.
Total-Order Sequencer
For market making across multiple venues, it's critical that all events have a global order. The sequencer takes N input streams and merges them into one, assigning monotonically increasing sequence numbers. Every participant sees the same sequence of events.
Why Zig, Not Rust/C/C++?
We chose between four candidates. Here's an honest comparison:
| Criterion | Zig | C/C++ | Rust | Java (Aeron) |
|---|---|---|---|---|
| GC / runtime overhead | None | None | None | JVM safepoints, GC |
| Comptime code generation | Native | Macros/templates | proc macros | None |
| C interop (DPDK, io_uring) | Trivial @cImport |
Native | FFI/bindgen | JNI overhead |
| SIMD | @Vector, built-in |
Intrinsics | packed_simd (unstable) | Vectorization hints |
| Cross-compilation | Built-in | CMake hell | cargo target | N/A |
| Build time | Seconds | Minutes (C++) | Minutes | Seconds + JVM startup |
| Hidden control flow | None | Exceptions, implicit casts | Panics in unwrap |
Exceptions |
Zig gave us a unique combination: C-level performance + safety during development + comptime metaprogramming (codecs, lookup tables, protocol state machines — all generated at compile time) + trivial integration with DPDK, liburing, ef_vi via @cImport.
And a Zig binary weighs ~100 KB. Versus 20+ MB for a JVM-based solution.
Bindings: Work in Your Language
ZigBolt compiles to a shared library with a C-ABI, and we have ready-made bindings for five languages:
TypeScript / Node.js
import { IpcChannel } from "@zigbolt/node";
const channel = IpcChannel.create({
name: "/my-market-data",
termLength: 1024 * 1024,
});
const msg = Buffer.from("BTC/USDT 42000.50", "utf-8");
channel.publish(msg, 1);
Rust
use zigbolt::IpcChannel;
let ch = IpcChannel::create("/my-channel", 64 * 1024).unwrap();
ch.publish(b"hello", 1).unwrap();
let sub = IpcChannel::open("/my-channel", 64 * 1024).unwrap();
sub.poll(|data, msg_type_id| {
println!("got {} bytes, type={}", data.len(), msg_type_id);
}, 10);
Python
from zigbolt import IpcChannel
ch = IpcChannel.create("/market-data", term_length=1024*1024)
ch.publish(b"tick data here", msg_type_id=1)
Plus Go and plain C. The same shared memory channel is accessible from all languages simultaneously — publisher in Zig, subscriber in Python, monitoring in Go. They all read the same mmap region.
SBE Codec: FIX-Compatible Messages
For financial protocols, ZigBolt includes a full SBE (Simple Binary Encoding) codec with compile-time schemas. Built-in message types:
- NewOrderSingle — order submission
- ExecutionReport — execution report
- MarketDataIncrementalRefresh — incremental market data update
- MassQuote — mass quoting
- Heartbeat — connectivity check
- Logon — authentication
No external code generator, no XML. Everything is described with Zig structs and validated at compile time.
Wire Protocol: Aeron Compatibility
ZigBolt implements Aeron-compatible wire protocol flyweights:
- DataHeaderFlyweight — data frames
- StatusMessage — flow control
- NAK — negative acknowledgement
- Setup, RTT, Error — service frames
This means ZigBolt can coexist with existing Aeron infrastructure. Migration doesn't have to be a big bang.
What's Next
ZigBolt is currently at version 0.2.1. The core is stable, benchmarks are reproducible, bindings work. Coming soon:
- io_uring backend — zero-copy network transport on Linux 6.0+ (IORING_OP_SEND_ZC)
- DPDK / AF_XDP — kernel bypass for when every microsecond counts
- Multi-Raft — sharding by instrument/strategy
- Columnar archive — Apache Arrow/Parquet integration for analytics
- Hugepage support — pre-faulted 2MB/1GB hugepages to minimize TLB misses
Try It
- Website: zigbolt-landing.vercel.app
- Docs: zigbolt-landing.vercel.app/getting-started/introduction/
- Source code: github.com/suenot/zigbolt
- License: MIT
Build from source (zig build), run benchmarks (zig build bench), connect via FFI from any language. If you have Zig 0.15.1 and a couple of minutes — try the ping-pong benchmark and compare with your current solution.
Links:
- ZigBolt Landing: zigbolt-landing.vercel.app
- GitHub: github.com/suenot/zigbolt
- Aeron (for comparison): github.com/real-logic/aeron | our Aeron overview
- Zig language: ziglang.org
- Marketmaker.cc: marketmaker.cc
Citation
@software{soloviov2026zigbolt,
author = {Soloviov, Eugen},
title = {ZigBolt: Why We Built Our Own Aeron in Zig and Hit 20 Nanoseconds Per Message},
year = {2026},
url = {https://marketmaker.cc/en/blog/post/zigbolt-zig-messaging-hft},
version = {0.2.1},
description = {How and why we built an ultra-low-latency messaging system for HFT from scratch in Zig. No JVM, no GC, no surprises.}
}
MarketMaker.cc Team
Miqdoriy tadqiqotlar va strategiya