Multiple studies have shown that BERT is remarkably robust to pruning, yet few if any of its components retain high importance across downstream tasks. Contrary to this received wisdom, we demonstrate that pre-trained Transformer encoders are surprisingly fragile to the removal of a very small number of scaling factors and biases in the output layer normalization (<0.0001% of model weights). These are high-magnitude normalization parameters that emerge early in pre-training and show up consistently in the same dimensional position throughout the model. They are present in all six models of BERT family that we examined and removing them significantly degrades both the MLM perplexity and the downstream task performance. Our results suggest that layer normalization plays a much more important role than usually assumed.
— Lees op arxiv.org/abs/2105.06990
AI Tooling for Software Engineers in 2026
Market Dynamics, Agentic Transformation, and Enterprise Strategy Report Classification: PhD-Grade Research Synthesis Table of Contents 1. Abstract The AI tooling landscape for software engineers has undergone a fundamental transformation between 2024 and 2026. This research synthesizes Read more