BitNet b1.58
BitNet b1.58 is a new type of Large Language Model called a 1-bit LLM. 1-bit LLMs achieve remarkable efficiency without sacrificing performance. They do this by representing the majority of its knowledge with a simplified code using only the values -1, 0, or 1.
BitNet b1.58 utilizes only 1.58 bits per weight on average (instead of the more typical 16- or 32-bits), resulting in significant improvements in cost-effectiveness while maintaining comparable performance as longer bit length models.
Here's a breakdown of how it works:
- Ternary Quantization: At its core, BitNet b1.58 employs a method called ternary quantization. This means that most model weights are restricted to only three possible values: -1, 0, or 1. By limiting the number of possible values, the model drastically reduces its memory footprint .
- Mixed Precision: While most weights use the ternary representation, the 1.58 average comes from a small fraction of weights assigned a slightly higher precision using a mixed precision technique. These higher-precision weights are used for critical parts of the model where extra accuracy is necessary .
- BitLinear Layer: BitNet b1.58 is built on the BitNet architecture, which replaces the standard linear layer with a custom layer called BitLinear. This BitLinear layer is specifically designed to work with the ternary weights, enabling efficient training and inference.
The benefits of BitNet b1.58 include:
- Reduced Memory Footprint: As mentioned earlier, using ternary weights significantly reduces the memory required by the model. This makes it possible to run LLMs on devices with lower memory resources.
- Lower Power Consumption: Because BitNet b1.58 requires less memory and performs simpler computations due to the ternary weights, it also consumes less power to operate. This makes it ideal for deploying LLMs on battery-powered devices or in cloud environments where reducing energy consumption is a priority.
- Faster Inference: The simpler calculations involved in processing ternary weights enable BitNet b1.58 to deliver faster inference compared to traditional LLMs. This translates to quicker response times for tasks like text generation or machine translation.
BitNet b1.58 represents a significant advancement in LLM technology by offering a more efficient and sustainable approach to large language models.
BitNet b1.58 rivals the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption.
For more information:
https://www.reddit.com/r/mlscaling/comments/1b3e5ym/bitnet_b158_every_single_parameter_or_weight_of/
https://github.com/ggerganov/llama.cpp/issues/5761
https://huggingface.co/papers/2402.17764