Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The Kernel IR

The kernel IR is the intermediate representation used for all computation kernels. It is a doubly-linked list of 32-byte OpNodes stored in an arena (the Slab allocator).

Data Structure

pub struct Kernel {
    pub ops: Slab<OpId, OpNode>,
    pub head: OpId,
    pub tail: OpId,
    // ...
}

pub struct OpNode {
    pub prev: OpId,  // u32
    pub next: OpId,  // u32
    pub op: Op,      // 24 bytes (enum + payload)
}

The Slab<OpId, OpNode> is a Vec<OpNode> with a free-list. OpId is a u32 index — random access is O(1).

Unfolding

Before any optimization passes run, the kernel IR is unfoldedLoadView, StoreView, and Move ops are converted to direct index arithmetic (Load, Store with computed indices). After unfolding, all ops are fixed-size inline entries in the arena — no Box, no vtables, no per-op indirection.

The IR is in SSA form, except for Loop, If, and Define ops (which can carry mutable state).

Op Variants

Arithmetic

Op::Cast { x: OpId, dtype: DType }
Op::Unary { x: OpId, uop: UOp }
Op::Binary { x: OpId, y: OpId, bop: BOp }
Op::Mad { x: OpId, y: OpId, z: OpId }

Memory

Op::Define { dtype, scope, ro, len }
Op::Load { src, index, layout }
Op::Store { dst, x, index, layout }
Op::Const(Constant)

Control Flow

Op::Loop { len: Dim }
Op::EndLoop
Op::If { condition: OpId }
Op::EndIf
Op::Barrier { scope }

Indexing

Op::Index { len, scope, axis }

Hardware Accelerators

Op::Wmma { dims, layout, dtype, a, b, c }

Vectorization

Op::Vectorize { ops: Vec<OpId> }
Op::Devectorize { vec: OpId, idx }

View (before unfolding)

Op::Move { x: OpId, mop: Box<MoveOp> }
Op::Reduce { x: OpId, rop, n_axes }

Memory Layouts and Scopes

pub enum MemLayout {
    Scalar,
    Vector(u8),
    Tile { x, y, stride },
}

pub enum Scope {
    Global,
    Local,
    Register,
}

Backend Codegen

Because the IR is designed for it, backend codegen is trivial:

  1. deSSA — resolve SSA references to physical registers/memory
  2. Linear pass — walk the op linked list once, emitting instructions

No further optimizations, no complex lowering.

Debugging

Set ZYX_DEBUG=8 to print the kernel IR:

r18: i32 = def global, len=4
r44: u32 = gidx0    // 0..=0
r19: i32 = r18[r1]  // 0..=3 load