Introduction
This book documents the internals of zyx — a machine learning library and compiler.
Unlike traditional ML frameworks that separate the eager execution graph from the autograd graph, zyx uses a single unified graph for everything. This design eliminates duplication, enables seamless kernel fusion, and keeps the implementation lean — tensors are only 4 bytes and the graph uses ~10 node types.
Who This is For
This book is for developers who want to understand how zyx works under the hood: the architecture decisions, the optimization passes, the backend system, and how pieces fit together.
The Architecture at a Glance
User Code (Tensor API)
│
▼
Tensor Graph ─── Autograd (reverse-mode on same graph)
│
▼
Kernelizer (greedy fusion of graph nodes)
│
▼
Kernel IR (linked-list of ops)
│
▼
Optimization Passes (always-on + autotune)
│
▼
Backend Codegen (C, CUDA, OpenCL, Vulkan, WGPU, HIP)
Every tensor operation builds a graph node. When you call .realize() or .item(), the graph is traversed bottom-up, compatible nodes are fused into kernels, the kernels are optimized, and finally compiled to native code for the target device.
Why This Design
Most deep learning libraries use two separate graphs:
- A compute graph for eager execution
- An autograd graph for backpropagation
Zyx uses one graph for both. This means:
- The autograd system doesn’t need its own graph infrastructure — it reuses the same nodes
- Kernel fusion works across operation boundaries without special handling
- The implementation is debuggable (one graph to inspect, not two)
- Memory overhead is minimal: each graph node is ~16 bytes
The trade-off is that evaluation is lazy — you must call realize() to trigger computation. But this laziness enables optimizations that eager execution cannot: kernel fusion, dead code elimination, and cross-operation constant folding.
Architecture Overview
The zyx pipeline transforms high-level tensor operations into device-specific machine code.
Tensor API ──► Graph ──► Kernelizer ──► Kernel IR ──► Opt Passes ──► Backend Codegen
│
Autotune (clone + evaluate) └── deSSA + linear pass
Pipeline Stages
1. Tensor API
The user creates tensors and applies operations:
extern crate zyx;
use zyx::{DType, Tensor, ZyxError};
fn main() -> Result<(), ZyxError> {
let x = Tensor::randn([1024, 1024], DType::F32)?;
let y = x.relu();
let z = y * 2.0;
Ok(())
}
These calls do not compute anything. Each operation appends a node to the graph and returns a lightweight Tensor handle.
2. The Graph
The graph opset was taken from tinygrad, with changes to make it even smaller. This is the minimal set of operations that can express ALL linear algebra operations and ALL PyTorch ops — by stacking these nodes:
| Variant | Meaning |
|---|---|
Const | A constant value baked into kernels |
Leaf | A tensor stored on device |
Expand | Broadcast a dimension |
Permute | Transpose axes |
Reshape | Change shape without changing data |
Pad | Pad with zeros |
Reduce | Sum, max, etc. |
Cast | Change dtype |
Unary | Element-wise: relu, exp, sin, etc. |
Binary | Element-wise: add, mul, etc. |
Each node is ~16 bytes, plus a 4-byte reference count and slab metadata overhead. Still small enough that 10,000 nodes cost ~200 KB.
The graph is stored in a Slab — a dense array with free-list tracking. TensorId is a u32 index into this slab, making tensor handles 4 bytes.
3. The Kernelizer
When realize() is called, the kernelizer traverses the graph bottom-up and fuses compatible nodes into kernels. The kernelizer uses heuristics to decide where kernel boundaries go — it’s not a simple rule. A reduce node used by multiple downstream nodes does not necessarily force a split. If two downstream nodes are both expand ops, that may force fusion. Element-wise chains will almost always fuse into one kernel.
View operations (reshape, expand, permute, pad) are unfolded into index arithmetic in the kernel, becoming “free” — they don’t create separate operations.
4. The Kernel IR
After unfolding, the DAG is converted into a linear structure — a doubly-linked list of ops stored in an arena (the Slab<OpId, OpNode>). Each OpNode is 32 bytes stored inline in the arena — no Box, no vtables, no indirection.
OpId is a u32 index into the slab — random access is O(1). The IR is SSA, except for loops and Define ops (which can be mutable).
The design goal of the small opset: optimizations are easy to write. If you understand the IR, you can add a new pass in an afternoon.
5. Optimization Passes
Optimization passes work on the linear IR. Kernels are cloned and each variant is evaluated separately — no egraphs. The cost function evaluates thousands of variants per second.
Optimization passes emit IR that is deliberately simple to lower to backend instructions. A backend just does deSSA + a single linear pass over the IR. No complex backend-specific lowering.
The autotune system explores the search space by:
- Starting with the initial kernel
- Applying one optimization variant
- Hashing the kernel to detect duplicates
- Evaluating with the cost function (or launching and timing)
- Repeating by combining optimization sequences
- Selecting the best variant
There is no fixed number of optimization passes zyx aims to have. The design goal was a small opset in the graph and IR so that passes are easy to write. More passes will be added over time.
6. Backend Codegen
Each backend converts the stabilized kernel IR into target code. Since the IR is designed for this, codegen is a straight line: deSSA, then one pass over the ops emitting instructions. No further optimizations, no complex lowering.
Backends are dispatch via enums (no dyn Backend — that would require downcasting, which is ugly in Rust):
pub enum Device {
C(CDevice),
CUDA(CUDADevice),
OpenCL(OpenCLDevice),
Vulkan(VulkanDevice),
WGPU(WGPUDevice),
HIP(HIPDevice),
Dummy(DummyDevice),
// Tenstorrent, etc.
}
All backends are compiled into the library and selected at runtime.
7. Runtime and Scheduler (Current)
The scheduler picks a device based on free memory and compute capacity. It handles cross-device data transfers and tracks async execution via events.
Note: A new scheduler is under development in search.rs and search2.rs. It will use an e-graph-like budget-guided exhaustive fusion enumeration, including costs for memory movement. The current schedule.rs will be replaced with something much more powerful.
Debugging the Pipeline
ZYX_DEBUG | Output |
|---|---|
| 1 | Backend selection and hardware info |
| 2 | Performance info during realize |
| 4 | Kernels generated by the kernelizer |
| 8 | Kernel IR (before GPU execution) |
| 16 | Generated backend code (PTX, OpenCL C, etc.) |
| 32 | Autotune exploration |
Key Design Decisions
- One graph for everything — autograd and computation share the same graph. No need to specify which tensors require gradients.
- Inline ops — all ops live in the arena as flat 32-byte entries. No
Box, no vtables, no indirection. Passes allocate their own working data (hash maps, vecs) as needed. - Linear IR — linked list of fixed-size nodes. Optimizations traverse front-to-back or back-to-back.
- Backend codegen is trivial — the hard work is in the IR-level optimization passes.
- Explicit GradientTape — prevents graph node deletion instead of building a separate graph.
The Tensor
The Tensor is the primary user-facing type in zyx. It is designed to be a lightweight handle — only 4 bytes.
pub struct Tensor {
pub(super) id: TensorId, // u32 index into the graph slab
}
Design Choices
Why 4 Bytes?
Most ML frameworks have heavyweight tensor objects. PyTorch’s Tensor is a TensorImpl* with shape, stride, dtype, device, storage, and autograd metadata — easily 100+ bytes. In zyx, all metadata lives in the graph, not the tensor handle. The tensor is just an index.
Reference Counting
Tensors are reference-counted via the global RT (a Mutex<Runtime>):
impl Clone for Tensor {
fn clone(&self) -> Self {
RT.lock().retain(self.id);
Tensor { id: self.id }
}
}
If we used Arc instead, we would still need Mutex for the Runtime — Tensor(id, Arc<Mutex<Runtime>>). The current approach avoids the Arc overhead and keeps Tensor at 4 bytes. Since every tensor operation already locks the runtime to append a graph node, there’s no additional lock contention from reference counting.
Lazy Evaluation
Tensor operations don’t compute anything. They build graph nodes:
extern crate zyx;
use zyx::{DType, Tensor, ZyxError};
fn main() -> Result<(), ZyxError> {
let x = Tensor::randn([1024, 1024], DType::F32)?;
let y = x.relu();
let z = y.tanh();
// This triggers the whole pipeline:
Tensor::realize(vec![&z])?;
Ok(())
}
The key insight: since operations just append to the graph, repeated graph patterns are automatically recognized and optimized. A training loop that builds the same graph structure every iteration gets the benefit of caching without explicit compilation steps.
Construction Methods
Tensors can be created from:
extern crate zyx;
use zyx::{DType, Tensor, ZyxError};
fn main() -> Result<(), ZyxError> {
let t = Tensor::from([1.0f32, 2.0, 3.0]);
let t = Tensor::randn([1024, 1024], DType::F32)?;
let t = Tensor::uniform([1024, 1024], -1.0f32..1.0)?;
let ones = Tensor::ones([3, 3], DType::F32);
let zeros = Tensor::zeros([3, 3], DType::F32);
Ok(())
}
This also works from files on disk (lazy loading).
The Immutability Rule
Tensors are immutable — there is no in-place mutation:
extern crate zyx;
use zyx::{DType, Tensor, ZyxError};
fn main() -> Result<(), ZyxError> {
let x = Tensor::randn([3, 3], DType::F32)?;
let x_plus_one = &x + 1.0; // new tensor, no mutation
Ok(())
}
This makes autograd simpler (no mutation to track) and eliminates backpropagation errors from in-place modifications.
The Graph
The graph is the heart of zyx. Every operation a user performs builds a node in this graph. The graph is shared between computation and autograd — there is only one.
Data Structure
The graph is stored in the Runtime:
pub struct Runtime {
pub graph: Graph,
// ...
}
pub struct Graph {
pub nodes: Slab<TensorId, (u32, Node)>,
pub gradient_tape: Option<Set<TensorId>>,
pub shapes: Map<TensorId, Box<[Dim]>>,
// ...
}
Slab Allocator
The Slab<TensorId, (u32, Node)> is a dense array with free-list tracking. Insertion is O(1) amortized, and iteration is cache-friendly. Each node is a (reference_count, Node) pair stored inline.
A TensorId is just a u32 index into this slab. This is why Tensor is 4 bytes — it’s an index, not a pointer.
Node Types
The graph opset was derived from tinygrad, with changes to make it even smaller. By stacking these types, zyx can express ALL linear algebra operations and ALL PyTorch ops:
pub enum Node {
Const { value: Constant },
Leaf { dtype: DType },
Expand { x: TensorId },
Permute { x: TensorId },
Reshape { x: TensorId },
Pad { x: TensorId },
Reduce { x: TensorId, rop: BOp },
Cast { x: TensorId, dtype: DType },
Unary { x: TensorId, uop: UOp },
Binary { x: TensorId, y: TensorId, bop: BOp },
ToDevice { x: TensorId, device: DeviceId },
Custom(Box<CustomKernel>),
}
Lifecycle with GradientTape
Without a gradient tape, realized graph nodes are immediately released:
realize(x) → compute x → replace x's node with Leaf → release x's inputs
With a gradient tape alive, realized nodes that the tape references are preserved:
realize(x) with tape → compute x → if tape.contains(x), keep node → else replace with Leaf
This is how autograd works on the same graph: the gradient tape prevents node deletion, so when you later call tape.gradient(), it can traverse the graph backward.
Graph Size
The graph is designed to stay small. With ~16 bytes per node + 4 bytes reference count, a training iteration with 10,000 operations costs ~200 KB. When the tape is dropped at the end of each iteration, the graph shrinks back to baseline.
Debugging the Graph
Inspect the graph at runtime:
Tensor::plot_graph(&[&output], "graph")?;
Or with environment variables:
ZYX_DEBUG=8 cargo run # prints kernel IR during compilation
The Kernelizer
The kernelizer (kernelize.rs) is the bridge between the high-level tensor graph and the low-level kernel IR. It traverses the computation graph and fuses compatible nodes into kernels.
When the Kernelizer Runs
The kernelizer is invoked during realize():
pub fn realize(x: &[&Tensor]) -> Result<(), ZyxError> {
// 1. Lock the runtime
// 2. Identify which tensors need evaluation
// 3. Topological sort based on reference counts
// 4. Call kernelizer to build kernels
// 5. Optimize and compile each kernel
// 6. Execute on device
// 7. Release unused nodes
}
Fusion Logic
The kernelizer processes graph nodes bottom-up in topological order. Each graph node type adds ops to the kernel its input lives in — fusion is the default, splitting happens only when needed.
Per-Node Type Behavior
| Graph Node | Kernel Decision |
|---|---|
Unary, Cast | Always fuse — add op to the input’s kernel |
Expand, Permute, Reshape, Pad | Add Move op to the input’s kernel (free after unfolding) |
Binary | Merge both input kernels into one |
Reduce | Add reduce op to the input’s kernel |
Const | Create a new kernel with a constant |
When Splitting Happens
The key splitting decision is in duplicate_or_store(). When a kernel has multiple outputs (a tensor used by >1 downstream), the kernelizer chooses:
- If preceded by a reduce (expensive to recompute) → store the intermediate result to global memory, create a new load kernel for the next consumer
- If NOT preceded by a reduce (cheap to recompute) → duplicate the kernel so each consumer gets its own copy
This is a cost heuristic, not a simple rule: is recomputing cheaper than a global memory store+load?
Stores also trigger automatically when a graph node is requested as the final output (to_eval), creating a natural boundary there.
Practical Outcome
Each kernel tends to center around one reduce loop, with all fused element-wise ops grouped before and after it. A chain of element-wise ops always fuses into one kernel. Reduce-heavy graphs may end up with each reduce in its own kernel separated by store/load boundaries.
The Kernelizer Struct
struct Kernelizer<'a> {
must_keep_nodes: Set<TensorId>,
pending_stores: Set<TensorId>,
realized_nodes: Set<TensorId>,
kernels: Slab<KMKernelId, Kernel>,
visited: Map<TensorId, (KMKernelId, OpId)>,
rcs: Map<TensorId, u32>,
graph: &'a Graph,
// ...
}
The visited map tracks which graph nodes have been converted to kernel ops. When a graph node is already in visited, the kernelizer uses the existing kernel result instead of recomputing — this is how shared subgraphs are handled.
Building Kernel Ops
Each graph node type maps to kernel IR operations:
| Graph Node | Kernel IR |
|---|---|
Const | Op::Const |
Leaf | Op::Define (global memory) |
Unary | Op::Unary |
Binary | Op::Binary or Op::Mad |
Reduce | Op::Reduce |
Reshape | Op::Move (view unfolding) |
Expand | Op::Move (view unfolding) |
Permute | Op::Move (view unfolding) |
Pad | Op::Move (view unfolding) |
View operations (reshape, expand, permute, pad) are unfolded into index arithmetic rather than becoming separate ops. This is how they become “free” — the index computation is inlined into the load/store operations.
Kernel Caching
pub struct KernelCache {
pub cache: Map<KernelId, Kernel>,
}
After a kernel is built and optimized, its hash is computed. If the same kernel was compiled before, the cached compiled program is reused. The cache persists across realize() calls, so repeated graph patterns (like training loop iterations) hit the cache on the second iteration.
The Kernel IR
The kernel IR is the intermediate representation used for all computation kernels. It is a doubly-linked list of 32-byte OpNodes stored in an arena (the Slab allocator).
Data Structure
pub struct Kernel {
pub ops: Slab<OpId, OpNode>,
pub head: OpId,
pub tail: OpId,
// ...
}
pub struct OpNode {
pub prev: OpId, // u32
pub next: OpId, // u32
pub op: Op, // 24 bytes (enum + payload)
}
The Slab<OpId, OpNode> is a Vec<OpNode> with a free-list. OpId is a u32 index — random access is O(1).
Unfolding
Before any optimization passes run, the kernel IR is unfolded — LoadView, StoreView, and Move ops are converted to direct index arithmetic (Load, Store with computed indices). After unfolding, all ops are fixed-size inline entries in the arena — no Box, no vtables, no per-op indirection.
The IR is in SSA form, except for Loop, If, and Define ops (which can carry mutable state).
Op Variants
Arithmetic
Op::Cast { x: OpId, dtype: DType }
Op::Unary { x: OpId, uop: UOp }
Op::Binary { x: OpId, y: OpId, bop: BOp }
Op::Mad { x: OpId, y: OpId, z: OpId }
Memory
Op::Define { dtype, scope, ro, len }
Op::Load { src, index, layout }
Op::Store { dst, x, index, layout }
Op::Const(Constant)
Control Flow
Op::Loop { len: Dim }
Op::EndLoop
Op::If { condition: OpId }
Op::EndIf
Op::Barrier { scope }
Indexing
Op::Index { len, scope, axis }
Hardware Accelerators
Op::Wmma { dims, layout, dtype, a, b, c }
Vectorization
Op::Vectorize { ops: Vec<OpId> }
Op::Devectorize { vec: OpId, idx }
View (before unfolding)
Op::Move { x: OpId, mop: Box<MoveOp> }
Op::Reduce { x: OpId, rop, n_axes }
Memory Layouts and Scopes
pub enum MemLayout {
Scalar,
Vector(u8),
Tile { x, y, stride },
}
pub enum Scope {
Global,
Local,
Register,
}
Backend Codegen
Because the IR is designed for it, backend codegen is trivial:
- deSSA — resolve SSA references to physical registers/memory
- Linear pass — walk the op linked list once, emitting instructions
No further optimizations, no complex lowering.
Debugging
Set ZYX_DEBUG=8 to print the kernel IR:
r18: i32 = def global, len=4
r44: u32 = gidx0 // 0..=0
r19: i32 = r18[r1] // 0..=3 load
Optimization Passes
Optimization passes operate on the linear kernel IR. They are divided into always-on passes (run before every compilation) and autotuned passes (searched at runtime for the best variant).
The design goal of the small opset is that new passes are easy to write. There is no fixed number zyx aims to have — passes will be added over time as performance opportunities are identified.
Always-On Optimizations
These run in a fixed pipeline before every kernel compilation:
pub fn run_always_on_optimizations(&mut self) {
self.constant_folding();
self.move_constants_to_beginning();
self.loop_invariant_code_motion();
self.common_subexpression_elimination();
self.fold_accs();
self.delete_empty_loops();
self.dead_code_elimination();
}
Constant Folding
Evaluates expressions where all inputs are constants. For example, 2 + 3 becomes 5 and sin(0.0) becomes 0.0.
Motion of Constants to Beginning
Moves all constant definitions to the start of the kernel. This creates better fusion opportunities and simplifies register allocation.
Loop-Invariant Code Motion (LICM)
Hoists operations that produce the same value on every iteration to before the loop.
Common Subexpression Elimination (CSE)
Reuses results of identical subexpressions by detecting them through hashing.
Accumulator Folding
Simplifies accumulator update patterns for more efficient reduction code generation.
Delete Empty Loops
Removes loops with zero iterations.
Dead Code Elimination
Removes ops whose results are never used. This is always the final pass — it ensures backends never receive unreferenced ops.
Autotuned Passes
The autotune system clones the kernel, applies optimization variants, and evaluates each separately. No egraphs — just clone, transform, hash, evaluate. The cost function can evaluate thousands of variants per second.
const AVAILABLE_OPTIMIZATIONS: [OptConfigFn; 6] = [
Kernel::opt_reassociate_commutative,
Kernel::opt_split_global_to_local,
Kernel::opt_upcast,
Kernel::opt_register_blocking,
Kernel::opt_tiled_reduce,
Kernel::opt_split_loop,
];
Reassociation
Reorders commutative operations to create more fusion opportunities.
Split Global to Local
Adjusts block and thread dimensions for better memory access patterns. For example, a kernel with block_dim = 1024, thread_dim = 1 becomes block_dim = 32, thread_dim = 32. This enables coalesced memory access on GPU.
Upcast Vectorization
Expands scalar operations to vector operations (e.g., 4-wide SIMD on CPU, wider on GPU).
Register Blocking
Unrolls tree reductions and coarsens global threads so each thread processes multiple elements, increasing computational intensity and register reuse.
Tiled Reduction
Implements multi-stage reduction: threads reduce into registers, workgroups reduce into local memory, then global atomics.
Loop Splitting
Splits large loops into chunks for better register pressure and instruction-level parallelism.
How Autotuning Searches
- Start with the initial kernel and run always-on optimizations
- Apply ONE optimization variant and run always-on optimizations
- Hash the kernel — skip if already visited
- Evaluate with cost function (or launch and time)
- Repeat by combining with existing optimization sequences
- Select the best variant based on actual timing or cost estimate
Correctness Guarantee
No optimization is needed for tests to pass. All tests must pass no matter which sequence of optimizations (including empty) is applied.
If an optimization breaks a test, the optimization is buggy — not the sequence ordering. The verify pass (verify.rs) checks internal IR consistency in debug mode.
Backend System
Zyx supports multiple hardware backends through an enum dispatch system. All backends are compiled into the library and selected at runtime.
Enum Dispatch
Backends use enums instead of trait objects (dyn Backend). Trait objects would require downcasting to access backend-specific functionality, which is ugly in Rust.
pub enum Device {
C(CDevice),
CUDA(CUDADevice),
OpenCL(OpenCLDevice),
Vulkan(VulkanDevice),
WGPU(WGPUDevice),
HIP(HIPDevice),
Dummy(DummyDevice),
}
pub enum MemoryPool {
Host(HostMemoryPool),
Disk(DiskMemoryPool),
C(CMemoryPool),
CUDA(CUDAMemoryPool),
OpenCL(OpenCLMemoryPool),
Vulkan(VulkanMemoryPool),
WGPU(WGPUMemoryPool),
HIP(HIPMemoryPool),
Dummy(DummyMemoryPool),
}
Each method matches on the variant and delegates:
impl Device {
pub fn compile(&self, kernel: &Kernel, debug: DebugMask) -> Result<ProgramId, ZyxError> {
match self {
Device::C(dev) => dev.compile(kernel, debug),
Device::CUDA(dev) => dev.compile(kernel, debug),
// ...
}
}
}
Backend Codegen is Trivial
The optimization passes do the hard work. Backend codegen is:
- deSSA — resolve SSA references to physical registers or memory locations
- Linear pass — walk the op linked list once, emitting target instructions
No further optimizations, no complex backend-specific lowering. The IR emits directly to the target language.
Initialization
Backends are initialized at startup via initialize_backends():
pub fn initialize_backends(config, memory_pools, devices, debug) {
host::initialize_pool(memory_pools, debug);
disk::initialize_pool(memory_pools, debug);
dummy::initialize_device(&config.dummy, ...);
c::initialize_device(&config.c, ...);
cuda::initialize_device(&config.cuda, ...);
opencl::initialize_device(&config.opencl, ...);
hip::initialize_device(&config.hip, ...);
vulkan::initialize_device(&config.vulkan, ...);
wgpu::initialize_device(&config.wgpu, ...);
#[cfg(feature = "tenstorrent")]
tenstorrent::initialize_device(&config.tenstorrent, ...);
}
Each backend tries to initialize. Failure (missing driver, no hardware) causes it to be skipped silently. If all backends fail, the program exits with an error.
Current Backends
| Backend | Source | Target | Runtime |
|---|---|---|---|
| C | c.rs | C99 (compiled to .so) | Clang/GCC |
| CUDA | cuda.rs | CUDA C (compiled to SASS) | CUDA driver via libloading |
| HIP | hip.rs | HIP | ROCm via libloading |
| OpenCL | opencl.rs | OpenCL C | OpenCL runtime via libloading |
| Vulkan | vulkan.rs | SPIR-V | Vulkan via ash crate |
| WGPU | wgpu.rs | SPIR-V | WGPU (feature: wgpu) |
| Dummy | dummy.rs | — | No hardware needed (fake device) |
All backends except WGPU and Tenstorrent are compiled in by default. WGPU requires --features wgpu. Tenstorrent requires --features tenstorrent.
Device Configuration in Config File
Each backend can be enabled/disabled and configured:
{
"c": { "enabled": true },
"cuda": { "device_ids": [0] },
"opencl": { "platform_ids": [] },
"dummy": { "enabled": false }
}
If a section is missing or the config file doesn’t exist, defaults are used (most backends enabled).
Device Selection
The scheduler picks a device at realize time:
- If
DeviceId::AUTO, sort devices by free compute capacity (descending) - If a specific device is requested, try it first
- Pick the first device with enough free memory for all required tensors
- If no device has enough memory, return an allocation error
Autograd
Zyx implements automatic differentiation through an explicit GradientTape. Unlike other frameworks, there is no separate autograd graph — it uses the same graph as computation.
The Key Insight
In most frameworks, autograd requires a separate graph because the eager execution engine discards intermediate results. Zyx’s lazy graph keeps all nodes alive until realize() is called, and the gradient tape simply prevents their deletion.
GradientTape API
extern crate zyx;
use zyx::{DType, GradientTape, Tensor, ZyxError};
fn main() -> Result<(), ZyxError> {
let tape = GradientTape::new();
let x = Tensor::randn([2, 3], DType::F32)?;
let y = Tensor::randn([2, 3], DType::F32)?;
let z = x.relu() * y.tanh();
let grads = tape.gradient(&z, vec![&x, &y]);
// grads[0] = gradient of z w.r.t. x
// grads[1] = gradient of z w.r.t. y
Ok(())
}
The gradient() method consumes the tape; use gradient_persistent() to keep it alive for higher-order derivatives.
No “requires_grad”
There’s no requires_grad flag on tensors. The tape records the entire graph; when you call gradient(), you specify which tensors you want gradients for:
extern crate zyx;
use zyx::{DType, GradientTape, Tensor, ZyxError};
fn main() -> Result<(), ZyxError> {
let tape = GradientTape::new();
let x = Tensor::randn([2, 3], DType::F32)?;
let y = Tensor::randn([2, 3], DType::F32)?;
let z = y.exp();
let grads = tape.gradient(&z, vec![&x]); // None — z doesn't depend on x
Ok(())
}
This is more flexible — you don’t need to decide at tensor creation time which tensors will be differentiated.
Higher-Order Derivatives
Using gradient_persistent(), the tape stays alive and you can compute higher-order derivatives:
// Higher-order derivatives example: currently blocked by
// an autograd subtraction overflow bug (autograd.rs:94).
Memory Efficiency
Since the graph is lazy, intermediate tensors needed for backpropagation are not held in memory until realize() is called for the gradient computation. The gradient tape stores only TensorId values — not the actual data. When the tape is dropped, all tape-preserved nodes are released.
Module System
The module system provides a way to group tensors (parameters) into neural network layers. It’s defined by the Module trait and powered by #[derive(Module)].
The Module Trait
pub trait Module {
fn iter(&self) -> impl Iterator<Item = &Tensor>;
fn iter_mut(&mut self) -> impl Iterator<Item = &mut Tensor>;
fn iter_tensors(&self) -> impl Iterator<Item = (String, &Tensor)>;
fn iter_tensors_mut(&mut self) -> impl Iterator<Item = (String, &mut Tensor)>;
fn realize(&self) -> Result<(), ZyxError>;
fn save(&self, path: impl AsRef<Path>) -> Result<(), ZyxError>;
fn set_params(&mut self, params: &mut HashMap<String, Tensor>);
}
#[derive(Module)]
The #[derive(Module)] macro (from zyx-derive) generates the trait implementation, collecting all tensor fields recursively. This works with nested modules:
#[derive(Module)]
struct Linear {
weight: Tensor,
bias: Tensor,
}
#[derive(Module)]
struct MLP {
layer1: Linear,
layer2: Linear,
layer3: Linear,
}
Using Modules
#[derive(Module)]
struct SimpleNet {
linear1: Linear,
linear2: Linear,
}
fn train_step(model: &mut SimpleNet, optim: &mut SGD, x: &Tensor, target: &Tensor) -> f32 {
let tape = GradientTape::new();
let output = model.forward(x);
let loss = output.mse_loss(target)?;
let grads = tape.gradient(&loss, &model);
optim.update(model, grads);
Tensor::realize_all()?;
loss.item()
}
The tape.gradient(&loss, &model) call passes the model itself as the sources. The autograd system iterates over model.iter() to get all parameters.
Serialization
Modules can save and load parameters in safetensors format:
model.save("model.safetensors")?;
let params = Tensor::load_safetensors("model.safetensors")?;
model.set_params(&mut params);
Runtime and Scheduler
The runtime (runtime.rs) is the global state of zyx — the graph, devices, buffers, and kernel cache all live here.
The Runtime
pub(crate) struct Runtime {
pub graph: Graph,
pub devices: Slab<DeviceId, Device>,
pub pools: Slab<PoolId, MemoryPool>,
pub buffer_map: Map<TensorId, BufferId>,
pub events: Map<BTreeSet<BufferId>, Event>,
pub kernel_cache: KernelCache,
pub rng: Rng,
pub autotune_config: AutotuneConfig,
pub debug: DebugMask,
pub training: bool,
}
The runtime is stored in a global Mutex<Runtime>:
static RT: Mutex<Runtime> = Mutex::new(Runtime::new());
Every tensor operation locks this mutex, appends a graph node (microseconds), and releases. The lock is never held during computation — only during graph manipulation.
Memory Pools
Each backend provides a MemoryPool for allocating device buffers:
pub enum MemoryPool {
Host(HostMemoryPool),
Disk(DiskMemoryPool),
C(CMemoryPool),
CUDA(CUDAMemoryPool),
// ...
}
The Scheduler (Current)
The current scheduler (schedule.rs) selects a device for kernel execution by calculating required memory, sorting devices by free compute capacity, and picking the first with enough free memory. It handles cross-device transfers via events.
New Scheduler (In Development)
A new scheduling approach is under development in search.rs and search2.rs. It will use an e-graph-like budget-guided exhaustive fusion enumeration, including costs for memory movement operations — replacing the current simple heuristics with exploration of all fusion configurations within a cost budget.
Async Execution
Events track kernel completion:
events: Map<BTreeSet<BufferId>, Event>
Lazy Device I/O
Tensors can reference data on disk without loading it:
let t = Tensor::from_safetensors("model.safetensors", "layer.weight")?;
The disk pool keeps file offset information. Data is loaded lazily when the tensor needs to be realized on a compute device.
Configuration
Zyx is configured through a JSON file, environment variables, and API calls.
Config File
The runtime reads $XDG_CONFIG_HOME/zyx/config.json (or ~/.config/zyx/config.json by default):
{
"c": { "enabled": true },
"cuda": { "device_ids": [0] },
"opencl": { "platform_ids": [0] },
"autotune": {
"enabled": true,
"max_configs": 100,
"timeout_ms": 5000
}
}
Backend Control
| Config | Effect |
|---|---|
"c": { "enabled": true } | Enable CPU backend (off by default) |
"cuda": { "device_ids": [] } | Disable all CUDA devices |
"opencl": { "platform_ids": [] } | Disable all OpenCL |
"dummy": { "enabled": true } | Enable dummy backend |
Environment Variables
Debug Flags
Set ZYX_DEBUG as a bitmask:
| Value | Output |
|---|---|
| 1 | Backend selection and hardware info |
| 2 | Performance info during realize |
| 4 | Kernels generated by the kernelizer |
| 8 | Kernel IR (before GPU execution) |
| 16 | Generated backend code |
| 32 | Autotune exploration |
ZYX_DEBUG=8 cargo run # print kernel IR
ZYX_DEBUG=16 cargo run # print generated code
Colorless Output
AGENT=1 ZYX_DEBUG=8 cargo test
API Configuration
Tensor::manual_seed(42); // deterministic RNG
Tensor::set_training(true); // enable training mode
Tensor::set_training(false); // disable training mode