Techniques & Methods

Distributed Training

Distributed training parallelizes computation across many accelerators using data parallelism (splitting batches across devices) and model parallelism (splitting model layers across devices). Pipeline parallelism further staggers computation to maximize GPU utilization.

Training frontier LLMs like GPT-4 or Llama required thousands of GPUs running for months. Frameworks like DeepSpeed, Megatron-LM, and PyTorch FSDP enable efficient distributed training at scale by managing communication overhead between devices.

Authority Links

Distributed Computing — Wikipedia

Foundations of distributed systems applicable to AI training.

DeepSpeed

Microsoft's deep learning optimization library for distributed training.

Related Terms

Techniques & Methods

Training

Teaching a model to make accurate predictions by exposing it to large datasets.

Model Components

Large Language Model (LLM)

A transformer-based neural network with billions to trillions of parameters, trained on broad text corpora to predict the next token and able to generate, summarize, classify, and reason over natural language.

Techniques & Methods

Pre-training

Initial phase where a model learns general representations from large datasets before task-specific fine-tuning.

Model Components

Parameter

A learnable variable within a model whose value is adjusted during training to minimize prediction error.

Entity Annotation Dependency Parsing