Techniques & Methods

Bandit Optimization

Bandit optimization, named for slot machine ("one-armed bandit") problems, seeks to maximize cumulative reward when action outcomes are uncertain. At each step, the algorithm must decide whether to exploit the currently best-known option or explore less-tried options that might be better.

Bandit algorithms (UCB, Thompson Sampling, Epsilon-Greedy) are used in A/B testing, recommendation systems, ad placement, and hyperparameter tuning. They are more efficient than grid search because they allocate more trials to promising configurations.

Authority Links

Multi-Armed Bandit — Wikipedia

Theory and algorithms for multi-armed bandit optimization.

IBM — Bandit Algorithms

Bandit methods in the context of RL and recommendation systems.

Related Terms

Techniques & Methods

Reinforcement Learning

An agent learns by taking actions in an environment and receiving rewards or penalties.

Techniques & Methods

Proximal Policy Optimization (PPO)

RL algorithm that balances exploration and exploitation by constraining policy update size.

Techniques & Methods

Offline Reinforcement Learning

Learning optimal policies from fixed historical datasets without interacting with a live environment.

Techniques & Methods

Evaluation Metrics

Quantitative measures used to assess how well an AI model performs on a task.

Beam Search Backward Chaining