You are currently viewing Gomoku Reinforcement Learning Agent

Gomoku Reinforcement Learning Agent

Developed autonomous game-playing agents for 9×9 Gomoku using two fundamentally different reinforcement learning paradigms: Double Deep Q-Networks (DDQN) and AlphaZero. Compared value-based and model-based tree search approaches, achieving superhuman performance through self-play learning without hardcoded domain knowledge.

Code: https://github.com/yash-joshi5379/Gomoku-rl

Project Overview:

Implemented end-to-end reinforcement learning systems to learn Gomoku strategy from zero knowledge. The project combined a custom game environment with two distinct training pipelines: a value-based DDQN agent trained through experience replay and curriculum learning against random/heuristic/self-play opponents, and an AlphaZero architecture coupling deep residual networks with Monte Carlo Tree Search. Conducted systematic reward tuning for DDQN across four iterations, identified and resolved a critical distribution gap in AlphaZero through tactical threat detection, and benchmarked all agents in a round-robin tournament revealing a stark hierarchy in playing strength.

Game Environment and Visualization:

  • Board Representation: 9×9 NumPy array with integer encoding (0=empty, 1=black, 2=white); enables efficient vectorized operations for occupancy checks, threat detection, and move generation
  • Win Detection: Scans only directions passing through most recently placed stone (horizontal, vertical, both diagonals), reducing redundant full-board checks while remaining logically correct
  • State Cloning: Full deep copy of board state, move history, and game result enables parallel tree search and counterfactual trajectory rollouts without modifying live game state
  • Pygame Visualization: Interactive GUI with highlighted last move, current player indicator, and mouse-to-board coordinate mapping; logs training metrics simultaneously to CSV and TensorBoard

Approach 1: Double Deep Q-Networks

  • Architecture: Residual convolutional network with 4 blocks (64 channels) operating on 3-channel input (agent stones, opponent stones, turn indicator); skip connections mitigate vanishing gradients and preserve global context across full board receptive field
  • Action Masking: Pre-move masking sets occupied squares to -10⁹ Q-values, preventing illegal move selection both during training and inference; eliminates systematic error from evaluating impossible state transitions
  • Stability Mechanisms: Experience replay buffer (50,000 capacity) breaks temporal correlations; target network synchronized every 2,000 steps decouples action selection from value evaluation (Double DQN), reducing upward Q-value bias
  • Symmetry Augmentation: 8-fold dihedral symmetry (4 rotations × 2 reflections) multiplies training data 8× without additional game generation cost; encourages rotationally-invariant position evaluation
  • Curriculum Learning: 1,000 random games → 1,000 heuristic games → variable self-play episodes; exponential ε-decay transitions from exploration to greedy exploitation across phases

DQN Reward Tuning Iterations:

  • Iteration 1: Rewards [-2.0 to 1.0] including BLOCK_FOUR=0.8, OPEN_THREE=0.08, WIN=1.0; agent achieved center control but failed critical blocking (settled at 50% self-play win rate)
  • Iteration 2: Increased WIN to 10.0 to prioritize victory; agent learned early 5-in-a-row patterns but pathologically short games (5-6 moves), unable to adapt against blocking opponents
  • Iteration 3: Normalized rewards to [-1.0, 1.0], increased three/four structure rewards (OPEN_THREE=0.4, OPEN_FOUR=0.7); improved defensive blocking and game length but still failed edge-case blocks
  • Iteration 4: Maintained iteration 3 rewards but doubled self-play episodes (4,000→8,000); exhibited stronger blocking on center lines but persistent diagonal weaknesses; defensive preference dominated over attacking

Approach 2: AlphaZero with Adaptations

  • Algorithm: Couples residual network (policy head + value head) with batched MCTS; network guides tree search via prior probabilities and position values; search results provide improved policy targets for network training
  • Hardware Adaptation: Scaled from 5,000 TPUs to single RTX 4090; reduced residual blocks (20→10), filters (256→128), MCTS simulations (800→400), batch size (4,096→256); achieved 4× speedup through batched parallel MCTS where all N games evaluate leaf states in single GPU forward pass
  • Threat Detection System: Rule-based intervention during training only (not at play time) with three-tier priority: immediate win (100%), immediate block (100%), open three detection (50% probability); forces network to encounter sparse-board defensive positions outside its self-play distribution
  • Critical 50% Randomization: Disabling open-three detection in 50% of nodes produces mixed training data (decisive + drawn games) rather than all 81-move draws; enables network to learn blocking patterns on sparse early-game boards matching human play scenarios
  • Learning Rate Scaling: Reduced from 0.2 (batch size 4,096) to 0.05 (batch size 256) following gradient step scaling; 0.2 caused loss flatline, 0.01 converged too slowly

AlphaZero Model Comparison:

  • Model A (400/30/180): 400 MCTS simulations, 30 games/iteration, 180 iterations, 13 hours training; final policy loss 2.06, value loss 0.46; places 94.1% probability on center square in opening; 100% win vs random
  • Model B (800/40/200): 800 simulations, 40 games/iteration, 200 iterations, 26 hours training; final policy loss 1.09 (52% reduction), value loss 0.13 (72% reduction); distributes opening probability equally across center-adjacent squares; stronger positional play and counter-attacking
  • Key Difference: Deeper MCTS in Model B produces more informative value targets (more frequent terminal state encounters); manifests as sharper move distributions (700-800/800 top-move visits vs more diffuse Model A)

Round-Robin Tournament Results:

  • Agents Tested: DQN-A (iteration 3), DQN-B (iteration 4), AlphaZero Model A, AlphaZero Model B, Rule-Based Heuristic
  • Win Rate Hierarchy: AZ-B (85.0%, 34-0-6) > AZ-A (72.5%, 29-5-6) > Heuristic (40.0%) > DQN-A (20.0%) > DQN-B (15.0%)
  • AlphaZero Dominance: Both models achieved 100% win rate against both DQN agents, winning in 13-17 moves; defeats manifest in early tactical blunders indicating DQN failure at threat detection
  • DQN Asymmetry: DQN-A won 8 games as Black, 0 as White (20% to 0%); DQN-B won 5 Black, 1 White (12.5% to 2.5%); demonstrates learned dependence on first-move advantage without defensive competence
  • Game Lengths: AZ-A vs AZ-B averaged 56 moves (longest tournament matchup, reflecting strong mutual defense); AlphaZero vs DQN averaged 13-17 moves (early tactical victory); DQN vs Heuristic averaged 45+ moves (similar skill levels)

Distribution Gap and Solution:

  • Root Cause Identified: Pure self-play at limited compute creates training distribution gap; early network does not produce linear column attacks, so neither player encounters such positions; network learns to play well against itself but fails against out-of-distribution strategies
  • Novel Solution: Tactical threat detection override with 50% randomization on open-three patterns; provides sharp policy targets during training without constraining network at play time; enables network to internalize blocking patterns learned from forced-move feedback
  • Impact Validation: Model without threat detection achieved low policy loss (2.45) but could not block trivial column attacks at play time; with threat detection, successfully blocked human attacks and won competitive games

Key Technical Insights:

  • Value-based methods (DDQN) struggle with sparse, distant reward signals and reward shaping sensitivity; diagonal pattern detection failure suggests limited receptive field or reward structure gap
  • Model-based tree search (AlphaZero) explores state space more systematically, naturally discovering blocking responses through search even with poor initial policy; scales better to complex board positions
  • Self-play convergence at limited compute requires targeted distribution interventions; threat detection bridges gap between training and evaluation environments without requiring orders of magnitude more compute

Collaboration:

Developed with team of four across game environment design, DQN implementation and tuning, AlphaZero architecture and training orchestration, and tournament benchmarking. Systematic reward iteration (4 DQN variants), architectural ablation (2 AlphaZero models), and neutral engine evaluation enabled fair comparison across fundamentally different learning paradigms.

Technologies: Python, PyTorch, NumPy, Pygame, reinforcement learning (DDQN, AlphaZero), Monte Carlo Tree Search, convolutional neural networks, experience replay, curriculum learning, self-play training, reward shaping, TensorBoard logging