Techniques & Methods
Bandit Optimization
Bandit optimization, named for slot machine ("one-armed bandit") problems, seeks to maximize cumulative reward when action outcomes are uncertain. At each step, the algorithm must decide whether to exploit the currently best-known option or explore less-tried options that might be better.
Bandit algorithms (UCB, Thompson Sampling, Epsilon-Greedy) are used in A/B testing, recommendation systems, ad placement, and hyperparameter tuning. They are more efficient than grid search because they allocate more trials to promising configurations.
Authority Links
Related Terms
Techniques & Methods
Reinforcement Learning
An agent learns by taking actions in an environment and receiving rewards or penalties.
Techniques & Methods
Proximal Policy Optimization (PPO)
RL algorithm that balances exploration and exploitation by constraining policy update size.
Techniques & Methods
Offline Reinforcement Learning
Learning optimal policies from fixed historical datasets without interacting with a live environment.
Techniques & Methods
Evaluation Metrics
Quantitative measures used to assess how well an AI model performs on a task.

