Skip to content
📦 Technology & EngineeringAi Ml83 lines

Reinforcement Learning

Guide for reinforcement learning systems where agents learn through environment

Paste into your CLAUDE.md or agent config

Reinforcement Learning

Core Philosophy

Reinforcement learning is the branch of machine learning where an agent learns to make decisions by interacting with an environment and receiving feedback in the form of rewards or penalties. Unlike supervised learning, there is no labeled dataset — the agent must discover optimal behavior through trial and error. The fundamental challenge is balancing exploration of unknown actions with exploitation of known rewarding strategies.

RL is uniquely suited to sequential decision-making problems where the optimal action depends on the current state and has consequences that unfold over time. The agent must learn not just what is immediately rewarding but what leads to the best long-term cumulative outcome.

Key Techniques

  • Q-Learning: A model-free algorithm that learns the value of state-action pairs directly from experience. The agent maintains a table or function approximator mapping states and actions to expected future rewards.
  • Policy Gradient Methods: Directly optimize the policy function that maps states to actions, using gradient ascent on expected reward. Better suited for continuous action spaces than value-based methods.
  • Actor-Critic: Combines policy gradient (actor) with value function estimation (critic) to reduce variance while maintaining the ability to handle continuous actions.
  • Reward Shaping: Designing intermediate reward signals that guide the agent toward desired behavior without changing the optimal policy. Critical for environments with sparse natural rewards.
  • Model-Based RL: Learning a model of the environment's dynamics and using it to plan ahead, reducing the number of real interactions needed.
  • Multi-Armed Bandits: The simplest RL formulation where there is no state transition — just repeated choice among options with unknown payoffs. Foundation for understanding exploration-exploitation tradeoffs.

Best Practices

  • Start with the simplest algorithm that could work. Try tabular Q-learning or contextual bandits before jumping to deep RL.
  • Design reward functions carefully. Misaligned rewards produce agents that optimize for the wrong objective, often in surprising ways.
  • Normalize rewards and observations to stabilize training.
  • Use experience replay to break correlations in sequential data and improve sample efficiency.
  • Monitor training with multiple metrics beyond reward: episode length, value function estimates, policy entropy, and success rate on evaluation episodes.
  • Train with multiple random seeds. RL results are notoriously high-variance.
  • Separate exploration policy from evaluation policy to get accurate performance estimates during training.

Common Patterns

  • Sim-to-Real Transfer: Train the agent in a simulated environment where interactions are cheap and fast, then transfer the learned policy to the real world with domain randomization or fine-tuning.
  • Curriculum Learning: Start with easy tasks and progressively increase difficulty as the agent improves, avoiding the problem of sparse rewards in complex environments.
  • Hierarchical RL: Decompose complex tasks into sub-goals with separate policies for high-level planning and low-level execution.
  • Inverse RL: Learn a reward function from expert demonstrations rather than specifying it manually, useful when the desired behavior is easier to show than to formalize.

Anti-Patterns

  • Using deep RL when a simpler optimization method would suffice. RL adds enormous complexity and sample requirements.
  • Designing reward functions that are easy to hack. Agents will find and exploit any shortcut that maximizes reward without achieving the intended goal.
  • Training without sufficient exploration. Premature convergence to suboptimal policies is the most common failure mode.
  • Ignoring the sim-to-real gap when transferring policies from simulation.
  • Using RL for problems with abundant labeled data where supervised learning would be more efficient and stable.
  • Evaluating on training environments only without testing generalization to novel situations.