Researchers achieve 2700x speedup in FPGA-based Kolmogorov-Arnold Networks
New architecture leverages B-spline locality for sub-microsecond inference and online learning, outperforming traditional multi-layer perceptrons in hardware efficiency.
Researchers led by Aarush Gupta have demonstrated a novel implementation of Kolmogorov-Arnold Networks (KANs) on Field-Programmable Gate Arrays (FPGAs), achieving sub-microsecond latency and a 2700x speedup over previous KAN-FPGA systems. By leveraging the univariate activation functions and B-spline locality inherent to KANs, the architecture maps efficiently to FPGA lookup tables, enabling stable online learning with over 50,000 parameters at nanosecond speeds. This approach allows for real-time gradient updates entirely on the hardware, outperforming traditional multi-layer perceptrons in hardware efficiency and convergence across benchmarks such as function approximation and non-stationary control. The findings are detailed in a 2026 ACM/SIGDA Best Paper titled "KANELÉ: Kolmogorov–Arnold Networks for Efficient LUT-based Evaluation" and an ICML 2026 preprint.
Most modern machine learning workloads rely on graphics processing units (GPUs) for their high throughput in parallel execution. However, GPUs incur significant overhead from instruction scheduling and dynamic memory access, making them unsuitable for applications requiring ultra-low latency. Field-programmable gate arrays offer a solution by allowing the design of custom digital circuits where neural networks are implemented directly as logic rather than sequential instructions. This hardware-level co-design is critical for extremely specialized workloads where nanosecond-level response times are mandatory.
The core innovation lies in how KANs replace the learnable weights and fixed activation functions of multi-layer perceptrons with learnable univariate functions. These functions map naturally to lookup tables on FPGAs, avoiding the exponential scaling issues associated with representing multivariate functions in lookup tables. Because each KAN activation is defined over a small, finite domain, the entire input range can be covered during quantization. This allows the network to be trained in software and then deployed as fixed models for inference on FPGAs, where activations are computed in parallel using lookup tables and summed via an adder tree.
Beyond static inference, the research addresses the need for online learning in non-static environments, such as quantum control or nuclear fusion, where models must adapt in real time. The system supports real-time gradient updates on the FPGA, enabling stable online learning with 50,000+ parameters at nanosecond speeds. Unlike CPUs or GPUs, which fetch weights from memory, the FPGA implements gradient update logic as a dedicated parallel circuit that directly modifies the memory storing the coefficients. This eliminates the latency associated with data transfer between processing units and memory.
The stability of this online learning process is driven by the properties of B-spline basis functions used in KANs. These functions exhibit locality, meaning only a small subset of basis functions are non-zero for any given input, allowing for sparse gradient updates under fixed-point quantization. Furthermore, the bounded nature of B-splines ensures that activations and gradients remain within predictable ranges, reducing quantization error and improving learning stability compared to multi-layer perceptrons. The researchers claim the system outperforms traditional multi-layer perceptrons in hardware efficiency and convergence on benchmarks including function approximation, qubit readout, and non-stationary control.


