CS451/551 · Reinforcement Learning · Homework 2

Assignment 2: Reinforcement Learning

Mars Rover — Red Planet Rescue

R-7

In this project your Mars Rover agent will navigate treacherous Martian terrain — avoiding craters and collecting geological samples. You will implement Value Iteration and tabular Q-Learning (with epsilon-greedy exploration) to teach Rover-7 to act optimally under stochastic solar interference.

75 Points · 5 Questions · Python 3.10+

Compatibility & Prerequisites

This project has been tested and works on Linux, macOS, and Windows.

Install Pygame (optional)

pip install pygame

The autograder, self-check, and all tests run without Pygame. The GUI is purely for visualization and exploration.

Verify your setup

python mars_rover.py -m

If Pygame is installed, you'll see an animated Mars surface. Use arrow keys to control the rover. Without Pygame, text mode activates automatically.

python mars_rover.py -t   # Force text mode (no Pygame needed)

Learning Objectives

Project Structure

Show / hide directory tree
mars_rover_rl/
├── mars_rover.py                  # Entry point — run this to start
├── autograder.py                   # Official autograder (75 pts)
├── check.py                       # Self-check (3 maps, no grade)
├── project_params.py               # Project metadata
├── util.py                         # Data structures: Counter, sampling
│
├── env/                            # Environment package
│   ├── mdp.py                     # Abstract MDP interface
│   ├── environment.py             # Abstract Environment interface
│   ├── mars_grid.py               # MarsGrid MDP, layout loader
│   └── crawler.py                 # Crawler robot domain
│
├── agents/                         # Agent base classes
│   └── learning_agents.py         # ValueEstimationAgent, ReinforcementAgent
│
├── students/                       # ✏ YOUR CODE — only edit these
│   ├── value_iteration_agents.py  # ✏ Q1: Value Iteration
│   ├── qlearning_agents.py        # ✏ Q3, Q4, Q5: Q-Learning
│   └── analysis.py                # ✏ Q2: Policy parameters
│
├── display/                        # Visualization (optional)
│   ├── gui_display.py             # Pygame animated Mars display
│   ├── text_display.py            # ASCII terminal display
│   └── crawler_gui.py             # Pygame crawler display
│
├── grading/                        # Test framework
│   ├── grading.py                 # Grade tracker
│   ├── test_parser.py             # .test/.solution file parser
│   ├── test_classes.py            # Question/TestCase base classes
│   └── mars_test_classes.py       # Project-specific test logic
│
├── layouts/                        # 9 grid layout files (.lay)
│
└── test_cases/                     # 19 test cases (q1–q5)
    └── q1/ ... q5/                # CONFIG + .test + .solution per question

Files you will edit

You only need to modify three files inside the students/ directory. All places are marked with # *** YOUR CODE HERE *** comments. Do not modify any other files.

FileWhat you implement
students/value_iteration_agents.py EDITValue Iteration agent (Q1)
students/qlearning_agents.py EDITQ-Learning agent + ε-greedy exploration (Q3, Q4, Q5)
students/analysis.py EDITPolicy parameter choices for Canyon Grid (Q2)

Exact methods to implement

students/value_iteration_agents.py — 3 methods

MethodQDescription
run_value_iteration()Q1Main VI loop — batch Bellman updates for k iterations
compute_q_value_from_values(state, action)Q1Q(s,a) = Σ T(s,a,s')·[R(s,a,s') + γ·V(s')]
compute_action_from_values(state)Q1Best action from current values (argmax over Q)

students/qlearning_agents.py — 5 methods

Class / MethodQDescription
QLearningAgent.get_q_value(state, action)Q3Return Q-value from Counter (default 0)
QLearningAgent.compute_value_from_q_values(state)Q3V(s) = max_a Q(s,a); return 0 if terminal
QLearningAgent.compute_action_from_q_values(state)Q3Best action — break ties randomly
QLearningAgent.get_action(state)Q4ε-greedy: random with prob ε, else best
QLearningAgent.update(state, action, next_state, reward)Q3Q-learning update rule

students/analysis.py — 5 functions

FunctionQDescription
question2a()Q2Return (discount, noise, living_reward) → close exit, risk cliff
question2b()Q2Close exit, avoid cliff
question2c()Q2Distant depot, risk cliff
question2d()Q2Distant depot, avoid cliff
question2e()Q2Never terminate (wander forever)

Getting Started

Step 1 — Run the grid world manually

python mars_rover.py -m

Use arrow keys to move the rover. Press Enter/Space to extract at a sample site or crater. This helps you understand the grid mechanics, especially how solar interference makes you slip sideways.

Step 2 — Try different layouts

python mars_rover.py -m -g base_camp_grid
python mars_rover.py -m -g canyon_grid
python mars_rover.py -m -g maze_grid
python mars_rover.py --list-layouts        # See all 9 available grids

Step 3 — Understand the Mars Grid MDP

Every grid cell is one of:

-1
##
S
+1

S = Start  |  ## = Wall  |  +1 = Sample  |  -1 = Crater  |  Empty = Open terrain

Movement noise (solar interference): When the rover chooses to go north:

If the rover would hit a wall, it stays in place. At terminal cells (samples/craters), the only action is extract, which ends the episode.

Step 4 — Understand the key parameters

python mars_rover.py -a value --discount 0.9 --noise 0.2 --living-reward 0.0

Homework Tasks (Q1–Q5)

20 pts Q1 — Value Iteration

Implement ValueIterationAgent — an offline planner that computes optimal values using the Bellman equation:

V_{k+1}(s) = max_a  Σ_{s'} T(s,a,s') · [R(s,a,s') + γ · V_k(s')]

Implement three methods in students/value_iteration_agents.py:

  1. run_value_iteration() — the main loop (batch updates!)
  2. compute_q_value_from_values(state, action) — Q from V
  3. compute_action_from_values(state) — best action from V
Important

Use batch value iteration: create a new Counter for Vk+1 each iteration. Compute ALL new values from old values, then replace. Do not update in-place.

python mars_rover.py -a value -i 100 -k 10           # Run VI, 100 sweeps, 10 test episodes
python mars_rover.py -a value -i 5 -g base_camp_grid  # Watch 5 iterations converge
python autograder.py -q q1                              # Grade Q1
20 pts Q2 — Mission Profiles (Policies) Depends on Q1

The Canyon Grid has a close exit (+1), a distant depot (+10), and a crater cliff (-10 each along the bottom):

+1
##
+10
S
##
-10
-10
-10
-10

Choose (discount, noise, living_reward) to produce each behavior. Edit students/analysis.py:

FunctionDesired behavior
question2aPrefer close exit (+1), risking the cliff
question2bPrefer close exit (+1), avoiding the cliff
question2cPrefer distant depot (+10), risking the cliff
question2dPrefer distant depot (+10), avoiding the cliff
question2eAvoid all exits and craters (never terminate)
python mars_rover.py -g canyon_grid -a value --discount 0.9 --noise 0.2 --living-reward 0.0
python autograder.py -q q2
20 pts Q3 — Q-Learning

Implement a model-free agent that learns from experience. The Q-learning update rule:

Q(s,a) ← Q(s,a) + α · [r + γ · max_{a'} Q(s',a') - Q(s,a)]

Implement in students/qlearning_agents.py:

  1. get_q_value(state, action) — lookup from Counter
  2. compute_value_from_q_values(state) — max Q over actions
  3. compute_action_from_q_values(state) — best action (break ties randomly!)
  4. update(state, action, next_state, reward) — Q-learning update
Critical hint

Actions the rover has never tried have Q-value = 0 (Counter default). If all tried actions have negative Q-values, an untried action (Q=0) is the best choice. Don't skip unseen actions!

python mars_rover.py -a q -k 5 -m        # Manual Q-learning (5 episodes)
python mars_rover.py -a q -k 100          # Train 100 episodes
python autograder.py -q q3
8 pts Q4 — Epsilon-Greedy Exploration Depends on Q3

Implement get_action(state) — the ε-greedy strategy:

  • With probability ε: take a random legal action (explore)
  • Otherwise: take the best action from Q-values (exploit)

Use util.flip_coin(self.epsilon) and random.choice(legal_actions).

Note

A random action may happen to be the best action — that's fine. Don't filter out the optimal action during exploration.

After implementing Q4, the Mars Crawler robot should also work without further code changes:

python -m env.crawler              # Run the crawler — should learn to crawl forward
python autograder.py -q q4
7 pts Q5 — Q-Learning Deployment on Mars Depends on Q3 + Q4

No new code — this question stress-tests your Q3 + Q4 implementation. Deploy your Q-learner on the small_grid. Train for 2000 episodes, then test for 100. The autograder requires a win rate of ≥ 80%, and also verifies that the same Q-Learner trains the Crawler robot to move forward.

python mars_rover.py -a q -k 2010 -g small_grid -e 0.05 --alpha 0.2 --discount 0.8 -q
python autograder.py -q q5
Why larger grids strain tabular Q-Learning

On medium_grid or larger, tabular Q-learning struggles to converge — every cell is a separate state with separate Q-values, so the rover has no way to generalize that "craters are bad everywhere." This motivates function approximation, an out-of-scope follow-up topic.

Algorithm Reference

Value Iteration vs. Q-Learning

PropertyValue Iteration (Q1)Q-Learning (Q3)
Has model?Yes — knows T(s,a,s') and RNo — learns from experience
PlanningOffline (before acting)Online (while acting)
UpdatesSweeps over ALL statesUpdates ONE (s,a) per step
ConvergenceAfter enough iterationsAfter enough episodes

Common mistakes to avoid

Testing Your Implementation

Self-Check (no grade)

python check.py              # Run all checks on 3 maps
python check.py -q q1        # Check only Q1

The self-check tests your code on small_grid, base_camp_grid, and canyon_grid. It shows ✅/❌ per check with diagnostic messages. No points or grades — purely for debugging.

Official Autograder (75 points)

python autograder.py                 # Grade all questions
python autograder.py -q q1           # Grade one question
python autograder.py --no-graphics   # Skip any display

The autograder tests across the 9 grid layouts with pre-computed reference solutions. Values must match within 0.01 tolerance. You can run it as often as you like to track your progress — running it does not submit anything.

What the autograder checks

QuestionWhat it verifies
Q1V(s), Q(s,a), and π(s) at iterations 0, 1, 2, and 100 on multiple grids
Q2Policy at key cells matches expected direction for each parameter tuple
Q3Q-values after fixed training episodes match reference (with seeded randomness)
Q4Best-action frequency falls in expected range for given ε; pure exploit at ε=0
Q5Win rate ≥ 80% on small_grid; crawler robot moves forward

Useful Command-Line Options

# Layout selection
python mars_rover.py -g canyon_grid
python mars_rover.py -g expedition_grid
python mars_rover.py --list-layouts

# Agent selection
python mars_rover.py -a value -i 100    # Value Iteration, 100 sweeps
python mars_rover.py -a q -k 500        # Q-Learning, 500 episodes
python mars_rover.py -a random           # Random agent

# Parameters
python mars_rover.py --discount 0.9 --noise 0.2 --living-reward -0.04
python mars_rover.py -e 0.3 --alpha 0.5  # Q-learning: ε and α

# Display
python mars_rover.py -t                  # Text mode (no Pygame)
python mars_rover.py -m                  # Manual control (arrow keys)
python mars_rover.py -q                  # Quiet (no transition output)

# GUI keyboard shortcuts
# V = show values   Q = show Q-values   P = show policy   ESC = quit

Troubleshooting

"Method not implemented" / NotImplementedError

You haven't filled in a # *** YOUR CODE HERE *** section. Replace the raise_not_defined() call with your implementation.

V(start) is 0 after Value Iteration

You're using in-place updates. Create a new Counter() for Vk+1 each iteration, then replace self.values at the end of each sweep.

Q-Learning rover gets stuck

Check two things: (1) Are you breaking ties randomly in compute_action_from_q_values? (2) Are untried actions (Q=0) being considered? If all tried actions have negative Q, an untried action is best.

Q5 fails but Q3/Q4 pass

Q5 uses different learning parameters (ε=0.05, α=0.2, γ=0.8). Your Q-learning should work with any parameter values, not just the defaults.

Crawler doesn't move

Your Q-learning agent is too specific to the grid world. Make sure get_action, update, etc. don't assume anything about the state format — it's (arm_index, hand_index) for the crawler, not (x, y).

No graphics / Pygame errors

python mars_rover.py -t              # Text-only mode
python mars_rover.py -q              # Quiet mode (no display)

The autograder and check.py never require Pygame.

Points Breakdown

QuestionTopicPoints
Q1Value Iteration20
Q2Policy Analysis (Canyon Grid parameters)20
Q3Q-Learning20
Q4Epsilon-Greedy Exploration8
Q5Q-Learning Deployment + Crawler7
Total75

Academic Integrity