• lofi papers
  • Posts
  • ⏱️ Say “Good Bye” to Matrix Multiplication

⏱️ Say “Good Bye” to Matrix Multiplication

The Problem

Matrix multiplication (MatMul) is computationally expensive and dominates the overall cost in training and inference for large language models (LLMs). The problem escalates as models scale up, increasing both memory usage and execution time.

The Solution

The authors propose a MatMul-free approach for LLMs that replaces MatMul operations with lightweight alternatives:

MatMul-free Dense Layers with Ternary Weights
Instead of using traditional dense layers that rely on MatMul operations, they introduce BitLinear layers with ternary weights. Ternary weights are restricted to the values -1, 0, and +1, transforming complex multiplications into simple addition and subtraction operations.

Let's illustrate the benefits of ternary weights with a small example:
If the weights are floating numbers, we have to perform multiplication and addition to get the final results:

In contrast, when the values are {-1, 0, 1} only addition is performed, after zeroing or changing the sign of the values that correspond to 0 or -1

Hardware-efficient Fused BitLinear Layer
To further enhance efficiency, the authors propose a fused operation approach that combines multiple steps, such as normalization and quantization, into a single process. This minimizes data movement between different memory levels, a common inefficiency in GPUs.

Ok but how was this used in a model architecture?

The authors propose a new architecture that eliminates matrix multiplication (MatMul) in two main parts of the model: token mixing and channel mixing.

Token Mixer
For processing sequences of tokens (words or subwords in text), the model uses a modified version of the Gated Recurrent Unit (GRU). The standard GRU is a type of recurrent neural network (RNN) that processes data sequentially and is designed to handle dependencies in sequences. In this modified GRU, all operations are converted to element-wise products and additions, which are simpler and less resource-intensive than MatMul.

Channel Mixer
For handling the transformations within each token's feature representation, the model employs a Gated Linear Unit (GLU) with BitLinear layers.

  • Gated Linear Unit: GLUs are used to add non-linear transformations and gating mechanisms to the network, helping it learn more complex patterns. The gating mechanism controls which parts of the input to pass through and which to suppress.

  • BitLinear Layers: These layers use ternary weights (values restricted to -1, 0, +1) instead of traditional full-range weights. This simplification allows the model to perform operations with much lower computational cost, as described previously.

Results

The proposed MatMul-free model shows competitive performance with state-of-the-art transformers. Key results include:

  • Memory Reduction: Up to 61% reduction in memory usage during training.

  • Inference Efficiency: Inference memory consumption is reduced by more than 10× with optimised kernels.

  • Hardware Implementation: Custom FPGA implementation shows further efficiency improvements, processing billion-parameter models at low power (13W).

Reply

or to participate.