Outlier LayerNorm Dimensions that Disrupt BERT

Multiple studies have shown that BERT is remarkably robust to pruning, yet few if any of its components retain high importance across downstream tasks. Contrary to this received wisdom, we demonstrate that pre-trained Transformer encoders are surprisingly fragile to the removal of a very small number of scaling factors and biases in the output layer normalization (<0.0001% of model weights). These are high-magnitude normalization parameters that emerge early in pre-training and show up consistently in the same dimensional position throughout the model. They are present in all six models of BERT family that we examined and removing them significantly degrades both the MLM perplexity and the downstream task performance. Our results suggest that layer normalization plays a much more important role than usually assumed.
— Lees op arxiv.org/abs/2105.06990

Outlier LayerNorm Dimensions that Disrupt BERT

Published by [email protected] on mei 17, 2021 maart 28, 2026

AI Tooling for Software Engineers in 2026

The LeanAI Transformation Blueprint

Blueprint of an AI Ecosystem.

Outlier LayerNorm Dimensions that Disrupt BERT

Published by [email protected] on mei 17, 2021 maart 28, 2026

Related Posts

AI Tooling for Software Engineers in 2026

The LeanAI Transformation Blueprint

Blueprint of an AI Ecosystem.