[2104.06644] Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

Data Platforms

A possible explanation for the impressive performance of masked language model (MLM) pre-training is that such models have learned to represent the syntactic structures prevalent in classical NLP pipelines. In this paper, we propose a different explanation: MLMs succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. To demonstrate this, we pre-train MLMs on sentences with randomly shuffled word order, and show that these models still achieve high accuracy after fine-tuning on many downstream tasks — including on tasks specifically designed to be challenging for models that ignore word order. — Lees op arxiv.org/abs/2104.06644

DjimIT Nieuwsbrief

AI updates, praktijkcases en tool reviews — tweewekelijks, direct in uw inbox.

[2104.06644] Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

DjimIT Nieuwsbrief

Gerelateerde artikelen

De AI-levenscyclus een benadering voor gefaseerde innovatie.

Who is the data owner?

Unlocking the Power of Language From Roman Jakobson to Large Language Models (LLMs)