Techniques & Methods

Masked Language Modeling

Masked language modeling (MLM), introduced in BERT, randomly replaces tokens in an input with a [MASK] token and trains the model to predict the original tokens. This bidirectional approach forces the model to use context from both sides of the masked position.

MLM produces powerful contextual representations useful for classification, NER, and question answering. It differs from causal/autoregressive language modeling (used in GPT) which only sees left context, making MLM models like BERT better at understanding and GPT models better at generation.

Authority Links

BERT Paper — arXiv

Original BERT paper introducing masked language modeling.

Masked LM — Wikipedia

How BERT and MLM transformed NLP understanding tasks.

Related Terms

Techniques & Methods

Pre-training

Initial phase where a model learns general representations from large datasets before task-specific fine-tuning.

Model Components

Transformer

A neural-network architecture, introduced by Vaswani et al. in 2017, that uses self-attention and parallel computation across all sequence positions — the foundation under virtually every frontier language and multimodal model in production today.

Model Components

Language Model

AI system that assigns probabilities to sequences of words and can generate coherent text.

Techniques & Methods

Training

Teaching a model to make accurate predictions by exposing it to large datasets.

Multitask Learning Markov Decision Process