Gomoku Reinforcement Learning Agent

Developed autonomous game-playing agents for 9×9 Gomoku using two fundamentally different reinforcement learning paradigms: Double Deep Q-Networks (DDQN) and AlphaZero. Compared value-based and model-based tree search approaches, achieving superhuman performance through self-play learning without hardcoded domain knowledge.

Code: https://github.com/yash-joshi5379/Gomoku-rl

Project Overview:

Implemented end-to-end reinforcement learning systems to learn Gomoku strategy from zero knowledge. The project combined a custom game environment with two distinct training pipelines: a value-based DDQN agent trained through experience replay and curriculum learning against random/heuristic/self-play opponents, and an AlphaZero architecture coupling deep residual networks with Monte Carlo Tree Search. Conducted systematic reward tuning for DDQN across four iterations, identified and resolved a critical distribution gap in AlphaZero through tactical threat detection, and benchmarked all agents in a round-robin tournament revealing a stark hierarchy in playing strength.

Game Environment and Visualization:

Board Representation: 9×9 NumPy array with integer encoding (0=empty, 1=black, 2=white); enables efficient vectorized operations for occupancy checks, threat detection, and move generation
Win Detection: Scans only directions passing through most recently placed stone (horizontal, vertical, both diagonals), reducing redundant full-board checks while remaining logically correct
State Cloning: Full deep copy of board state, move history, and game result enables parallel tree search and counterfactual trajectory rollouts without modifying live game state
Pygame Visualization: Interactive GUI with highlighted last move, current player indicator, and mouse-to-board coordinate mapping; logs training metrics simultaneously to CSV and TensorBoard

Approach 1: Double Deep Q-Networks

Architecture: Residual convolutional network with 4 blocks (64 channels) operating on 3-channel input (agent stones, opponent stones, turn indicator); skip connections mitigate vanishing gradients and preserve global context across full board receptive field
Action Masking: Pre-move masking sets occupied squares to -10⁹ Q-values, preventing illegal move selection both during training and inference; eliminates systematic error from evaluating impossible state transitions
Stability Mechanisms: Experience replay buffer (50,000 capacity) breaks temporal correlations; target network synchronized every 2,000 steps decouples action selection from value evaluation (Double DQN), reducing upward Q-value bias
Symmetry Augmentation: 8-fold dihedral symmetry (4 rotations × 2 reflections) multiplies training data 8× without additional game generation cost; encourages rotationally-invariant position evaluation
Curriculum Learning: 1,000 random games → 1,000 heuristic games → variable self-play episodes; exponential ε-decay transitions from exploration to greedy exploitation across phases

DQN Reward Tuning Iterations:

Iteration 1: Rewards [-2.0 to 1.0] including BLOCK_FOUR=0.8, OPEN_THREE=0.08, WIN=1.0; agent achieved center control but failed critical blocking (settled at 50% self-play win rate)
Iteration 2: Increased WIN to 10.0 to prioritize victory; agent learned early 5-in-a-row patterns but pathologically short games (5-6 moves), unable to adapt against blocking opponents
Iteration 3: Normalized rewards to [-1.0, 1.0], increased three/four structure rewards (OPEN_THREE=0.4, OPEN_FOUR=0.7); improved defensive blocking and game length but still failed edge-case blocks
Iteration 4: Maintained iteration 3 rewards but doubled self-play episodes (4,000→8,000); exhibited stronger blocking on center lines but persistent diagonal weaknesses; defensive preference dominated over attacking

Approach 2: AlphaZero with Adaptations

Algorithm: Couples residual network (policy head + value head) with batched MCTS; network guides tree search via prior probabilities and position values; search results provide improved policy targets for network training
Hardware Adaptation: Scaled from 5,000 TPUs to single RTX 4090; reduced residual blocks (20→10), filters (256→128), MCTS simulations (800→400), batch size (4,096→256); achieved 4× speedup through batched parallel MCTS where all N games evaluate leaf states in single GPU forward pass
Threat Detection System: Rule-based intervention during training only (not at play time) with three-tier priority: immediate win (100%), immediate block (100%), open three detection (50% probability); forces network to encounter sparse-board defensive positions outside its self-play distribution
Critical 50% Randomization: Disabling open-three detection in 50% of nodes produces mixed training data (decisive + drawn games) rather than all 81-move draws; enables network to learn blocking patterns on sparse early-game boards matching human play scenarios
Learning Rate Scaling: Reduced from 0.2 (batch size 4,096) to 0.05 (batch size 256) following gradient step scaling; 0.2 caused loss flatline, 0.01 converged too slowly

AlphaZero Model Comparison:

Model A (400/30/180): 400 MCTS simulations, 30 games/iteration, 180 iterations, 13 hours training; final policy loss 2.06, value loss 0.46; places 94.1% probability on center square in opening; 100% win vs random
Model B (800/40/200): 800 simulations, 40 games/iteration, 200 iterations, 26 hours training; final policy loss 1.09 (52% reduction), value loss 0.13 (72% reduction); distributes opening probability equally across center-adjacent squares; stronger positional play and counter-attacking
Key Difference: Deeper MCTS in Model B produces more informative value targets (more frequent terminal state encounters); manifests as sharper move distributions (700-800/800 top-move visits vs more diffuse Model A)

Round-Robin Tournament Results:

Agents Tested: DQN-A (iteration 3), DQN-B (iteration 4), AlphaZero Model A, AlphaZero Model B, Rule-Based Heuristic
Win Rate Hierarchy: AZ-B (85.0%, 34-0-6) > AZ-A (72.5%, 29-5-6) > Heuristic (40.0%) > DQN-A (20.0%) > DQN-B (15.0%)
AlphaZero Dominance: Both models achieved 100% win rate against both DQN agents, winning in 13-17 moves; defeats manifest in early tactical blunders indicating DQN failure at threat detection
DQN Asymmetry: DQN-A won 8 games as Black, 0 as White (20% to 0%); DQN-B won 5 Black, 1 White (12.5% to 2.5%); demonstrates learned dependence on first-move advantage without defensive competence
Game Lengths: AZ-A vs AZ-B averaged 56 moves (longest tournament matchup, reflecting strong mutual defense); AlphaZero vs DQN averaged 13-17 moves (early tactical victory); DQN vs Heuristic averaged 45+ moves (similar skill levels)

Distribution Gap and Solution:

Root Cause Identified: Pure self-play at limited compute creates training distribution gap; early network does not produce linear column attacks, so neither player encounters such positions; network learns to play well against itself but fails against out-of-distribution strategies
Novel Solution: Tactical threat detection override with 50% randomization on open-three patterns; provides sharp policy targets during training without constraining network at play time; enables network to internalize blocking patterns learned from forced-move feedback
Impact Validation: Model without threat detection achieved low policy loss (2.45) but could not block trivial column attacks at play time; with threat detection, successfully blocked human attacks and won competitive games