The Plan for 2025
Disclaimer
This post contains forward-looking, aspirational, statements. While these forward-looking statements represent our current plans, they are subject to risks and uncertainties that could cause actual results to differ materially. You are cautioned not to place undue reliance on these forward-looking statements, which reflect our opinions only as of the date of this post.
Update
Already, within a few months of 2025, I have already abandoned many of these goals, and started on new projects. This post is left up for historical purposes.
Intro
LLMs built on dense matrix multiplies are difficult to understand and control. Capabilities cannot be easily ablated from them. They are also too expensive for large scale beam search to be cost effective. We need to melt them down into combinational logic in which form they can be better understood, controlled, and edited.
Current video game hair graphics is unacceptably slow and poor quality. We need pixel level control. Drawing more triangles is not the correct path.
Scalar instructions are too slow, while FPGAs are not commodity, and data transfer over PCIe is painfully slow. Therefore, we need to make massively parallel combinational logic fast and flexible on CPU and GPU. To achieve this, we need bitslicing. To make bitslicing fast and flexible, we need fast bit matrix transpositions so that we can transition between bitsliced and scaler worlds fast. To make combinational logic flexible and scalable and runtime loadable, we need combinational logic as data, interpreted at runtime, however this has inherent performance overhead, so we need to support both combinational logic as Rust source code compiled to machine code at compile time, and combinational logic as data which is loaded and interpreted at runtime.
A
Improve the bitsliceing crate, primarily targeting AVX 512, but with support for other architectures. (mostly done, slowly polishing as I use it.)
B
Make non-square bit matrix transposes fast on AVX-512, and ideally, ARM. (Most of the work is done, I just need to package and test.) Benchmark on zen4, zen5, and ARM Neoverse V2.
C
Figure out structured containers in Rust. I may be able to use Generic Associated Types, or may have to hack together my own system.
D
Write a good combinatorial-logic-as-data framework and runtime. (Depends on C) (I’ve already built this ~2 times, time for another attempt and, hopeful do an actually good job this time.) Needs to compile the AIG to small LUTs so as to amortize the the instruction loading cost. Understand the performance compared with combinatorial logic as code. Build logic simplification tooling.
E
Build a batch processing framework which supports filtering and compaction. Depends on A and B. Optionally add support for combinatorial logic as data, (depends on D)
F
Build a cellular automata combinatorial logic framework. (Depends on A) Each cell contains structures bits. (Depends on C) In each tick, each cell gets to read from the 3x3 window of state, and update its state. This will require two 512x512 transposes. (For each of the however many state bits.) Initially support 512x512 space, them add support for multiples of 512x512 tiles. Optionally support combinational logic as data. (Depends on D)
G
Build a simple 2D platformer with bitsliced cellular automata graphics. (Depends on B and F) I should be able to get respectable performance on zen 4/5, but ideally port to GPU. Scale to 4K 60FPS if possible.
H
Learn Rust-GPU and Ash. Understand the performance characteristics of NVMe to CPU to GPU RAM to GPU LDS to CPU to NVMe, streaming data transfer.
I
Port bitslicing to GPU and build good bit matrix transposition on GPU.
J
Understand the performance characteristics of non-coalesced writes of bytes or ideally bits, on GPUs, primarily RDNA-3. Depends on H. Each stream processor will be calculating pixel indices into a 2D buffer of bytes, or ideally bits. They will be writing identical values into the cells, and thus, clobbering writes is a non-issue. If mutable buffers are shared across multiple compute units, performance will likely be poor and we may get nasty memory issues? Can we write individual bits without horrible performance/coherence issues? Does it help if we can stay purely within CU LDS? Is it better perf to have each stream processor write into its own section of a buffer of indicis, sharded by pixel row, and then have a serial process write the indicis into the bit arrays of each row? The issue is how to handle the raged arrays of indicis on GPU.
K
Build a real time hair graphics engine with basic hair physics. (conditional on J actually working out)
L
Melt an LLM into combinational logic. Ideally a byte level RNN. Produce an And Inverter Graph. (Depends on D)
M
Build a force directed graph layout tool. (Depends on H) The graphs will be directed, and acyclic, and have a well defined ordering of inputs pinned to two sides of the display area, and a well defined ordering of outputs pined to the two other sides. Each node will have exactly two inputs, and an unknown set of outputs. It needs to scale to multiple billions of nodes. Also implement realtime visualization. Port to GPU if needed.
N
Apply M to the result of L, and hopefully understand/control how the LLM behaves. (Depends on M and L)
O
Use bitsliced combinational-logic-as-data to evaluate the AIG from L fast on many inputs in parallel. Benchmark 512 wide on zen4 and 128 wide on Neoverse V2. Is single core performance <100ms per evaluation for a respectably good LLM? (Depends on L and D)
P
Put the LLM inference behind a high performance API. Allow the client to execute the state machine of the RNN on 128 (or 512 if AVX-512) data in parallel. Allow the client to load and store lanes of the SPMD machine. (Depends on O)
Q
Write a (probably web based?) graphical front end which calls P and performs speculative execution/very wide beam search, guided by the user.
R
Build some integration of interactive beam search (Q) and graph visualization tool (M) to allow the user to edit the program/machine state interactively?
S
Somehow charge money for Q???
T
Learn the ComfyUI stable diffusion pipeline.
U
Integrate the hair physics of K into the cellular automata graphics of F, and integrate into the CA game.
V
Figure out procedural music generation.
W
Publish the CA game on steam. (Depends on U, and V)
X
Make art for the qualia.moe LP/VN using diffusion. (Depends on T)
Y
Build full qualia.moe (Almost certainly not going to get this done in 2025)
Z
(Personal goals) Switch from Arch to NixOS. Switch to Alacritty. Switch from Xmonad to something wayland based, perhaps Smithay?