Techniques & Methods
Distributed Training
Distributed training parallelizes computation across many accelerators using data parallelism (splitting batches across devices) and model parallelism (splitting model layers across devices). Pipeline parallelism further staggers computation to maximize GPU utilization.
Training frontier LLMs like GPT-4 or Llama required thousands of GPUs running for months. Frameworks like DeepSpeed, Megatron-LM, and PyTorch FSDP enable efficient distributed training at scale by managing communication overhead between devices.
Authority Links
Related Terms
Techniques & Methods
Training
Teaching a model to make accurate predictions by exposing it to large datasets.
Model Components
Large Language Model (LLM)
A transformer-based neural network with billions to trillions of parameters, trained on broad text corpora to predict the next token and able to generate, summarize, classify, and reason over natural language.
Techniques & Methods
Pre-training
Initial phase where a model learns general representations from large datasets before task-specific fine-tuning.
Model Components
Parameter
A learnable variable within a model whose value is adjusted during training to minimize prediction error.

