Kan överköra dubellt linear

•

In this tutorial, we'll introduce an experimental concept in neural network design known as Kolmogorov-Arnold Networks (KANs), and explore their potential integration with transformer architectures, thus creating a "KANFormer."

KANs are inspired bygd the Kolmogorov-Arnold representation theorem and företräda a significant departure from traditional neural networks. Rather than using fixed activation functions and linear weights, KANs incorporate adjustable univariate functions, or splines, at each connection.

This tutorial will focus on the practical aspects of implementing and testing KANs within transformers, rather than delving into the deep mathematical foundations of KANs. Our aim fryst vatten to provide hands-on experience with KANs, and see how well they perform in place of traditional MLP's within the transformer.

Let's get started.

What we'll cover

The idea The MLP Splines A high level comparison between MLPs and KANs The KANFormerThe architecture The uppgifter Sweeps Comparison to a Stock GPT-2 Transformer

The idea

The idea of replacing the multi-layer perceptron (MLP) blocks in transformers with KAN layers is based on the hypothesis that the adaptive flexibil

•

Dual linear program

Mathematical optimization concept

The dual of a given linear program (LP) is another LP that is derived from the original (the primal) LP in the following schematic way:

Each variable in the primal LP becomes a constraint in the dual LP;
Each constraint in the primal LP becomes a variable in the dual LP;
The objective direction is inversed – maximum in the primal becomes minimum in the dual and vice versa.

The weak duality theorem states that the objective value of the dual LP at any feasible solution is always a bound on the objective of the primal LP at any feasible solution (upper or lower bound, depending on whether it is a maximization or minimization problem). In fact, this bounding property holds for the optimal values of the dual and primal LPs.

The strong duality theorem states that, moreover, if the primal has an optimal solution then the dual has an optimal solution too, and the two optima are equal.^[1]

These theorems belong to a larger class of duality theorems in optimization. The strong duality theorem is one of the cases in which the duality gap (the gap between the optimum of the primal and the optimum of the

•

Example Encouraging linearity

In cases where we don’t know how deep we should set KANs to be, one strategy is to try from small models, grudually making models wider/deeper until we find the minimal model that performs the task quite well. Another strategy is to start from a big enough model and prune it down. This jupyter notebook demonstrates cases where we go for the second strategy. Besides sparsity along width, we also want activation functions to be linear (‘shortcut’ along depth).

There are two relevant tricks:

set the base function ‘base_fun’ to be linear;
penalize spline coefficients. When spline coefficients are zero, the activation function is linear.

\(f(x)={\rm sin}(\pi x)\). Although we know a [1,1] KAN suffices, we suppose we don’t know that and use a [1,1,1,1] KAN instead.

without trick

from kan import * device = ('cuda' if _available() else 'cpu') print(device) # create dataset f(x,y) = sin(pi*x). This task can be achieved by a [1,1] KAN f = lambda x: (*x[:,[0]]) dataset = create_dataset(f, n_var=1, device=device) model = KAN(width=[1,1,1,1], grid=5, k=3, seed=0, noise_scale=, device=device) (dataset, opt="LBFGS", steps=20