Tag architecture

1 bookmark has this tag.

2025-01-30

18582m Academic

How has DeepSeek improved the Transformer architecture? | Epoch AI

epoch.ai/gradient-updates/how-has-deepseek-improved-the-transformer-architecture

DeepSeek v3, an open-weight model achieving state-of-the-art benchmark performance with significantly less training compute than comparable models, incorporates three major architectural improvements to the vanilla Transformer.

First, Multi-head Latent Attention (MLA) addresses the prohibitive cost of the Key-Value (KV) cache in long-context inference by representing key and value vectors as the product of two matrices involving a lower-dimensional latent vector, effectively implementing a low-rank compression of the KV cache across all attention heads to maintain quality while drastically reducing size, unlike less effective methods like grouped-query attention.

Second, DeepSeekMoE introduces several Mixture-of-Experts (MoE) innovations to mitigate "routing collapse": they replace auxiliary loss terms with a mechanism of expert-specific bias terms that are dynamically adjusted to ensure a balanced load without compromising model performance, and they utilize Shared Experts that are always routed to, reserving load-balancing only for the specialized "routed experts," thereby allowing the model to efficiently store common information without forcing a uniform distribution across all experts.

Third, Multi-token Prediction allows the model to predict the next token and the subsequent token in a single forward pass by feeding the first prediction's residual stream vector into an additional Transformer block, enabling a multi-token prediction objective during training for better performance and facilitating speculative decoding to nearly double inference speed.