CS451/551 Homework 2: Reinforcement Learning

Compatibility & Prerequisites

This project has been tested and works on Linux, macOS, and Windows.

Python 3.10+ — no external packages required (only the standard library)
Pygame — optional, for the animated Mars GUI display

Install Pygame (optional)

pip install pygame

The autograder, self-check, and all tests run without Pygame. The GUI is purely for visualization and exploration.

Verify your setup

python mars_rover.py -m

If Pygame is installed, you'll see an animated Mars surface. Use arrow keys to control the rover. Without Pygame, text mode activates automatically.

python mars_rover.py -t   # Force text mode (no Pygame needed)

Learning Objectives

Implement Value Iteration for offline MDP planning with batch Bellman updates.
Understand how discount factor, noise, and living reward shape optimal policies.
Implement tabular Q-Learning for model-free reinforcement learning from experience.
Balance exploration vs. exploitation using epsilon-greedy action selection.
Verify that the same Q-Learner generalizes from grid worlds to a continuing-task domain (the Crawler robot).
See firsthand why tabular methods struggle to scale, motivating function approximation as a future topic.

Project Structure

Show / hide directory tree

mars_rover_rl/
├── mars_rover.py                  # Entry point — run this to start
├── autograder.py                   # Official autograder (75 pts)
├── check.py                       # Self-check (3 maps, no grade)
├── project_params.py               # Project metadata
├── util.py                         # Data structures: Counter, sampling
│
├── env/                            # Environment package
│   ├── mdp.py                     # Abstract MDP interface
│   ├── environment.py             # Abstract Environment interface
│   ├── mars_grid.py               # MarsGrid MDP, layout loader
│   └── crawler.py                 # Crawler robot domain
│
├── agents/                         # Agent base classes
│   └── learning_agents.py         # ValueEstimationAgent, ReinforcementAgent
│
├── students/                       # ✏ YOUR CODE — only edit these
│   ├── value_iteration_agents.py  # ✏ Q1: Value Iteration
│   ├── qlearning_agents.py        # ✏ Q3, Q4, Q5: Q-Learning
│   └── analysis.py                # ✏ Q2: Policy parameters
│
├── display/                        # Visualization (optional)
│   ├── gui_display.py             # Pygame animated Mars display
│   ├── text_display.py            # ASCII terminal display
│   └── crawler_gui.py             # Pygame crawler display
│
├── grading/                        # Test framework
│   ├── grading.py                 # Grade tracker
│   ├── test_parser.py             # .test/.solution file parser
│   ├── test_classes.py            # Question/TestCase base classes
│   └── mars_test_classes.py       # Project-specific test logic
│
├── layouts/                        # 9 grid layout files (.lay)
│
└── test_cases/                     # 19 test cases (q1–q5)
    └── q1/ ... q5/                # CONFIG + .test + .solution per question

Files you will edit

You only need to modify three files inside the students/ directory. All places are marked with # *** YOUR CODE HERE *** comments. Do not modify any other files.

File	What you implement
`students/value_iteration_agents.py` EDIT	Value Iteration agent (Q1)
`students/qlearning_agents.py` EDIT	Q-Learning agent + ε-greedy exploration (Q3, Q4, Q5)
`students/analysis.py` EDIT	Policy parameter choices for Canyon Grid (Q2)

Exact methods to implement

`students/value_iteration_agents.py` — 3 methods

Method	Q	Description
`run_value_iteration()`	Q1	Main VI loop — batch Bellman updates for k iterations
`compute_q_value_from_values(state, action)`	Q1	Q(s,a) = Σ T(s,a,s')·[R(s,a,s') + γ·V(s')]
`compute_action_from_values(state)`	Q1	Best action from current values (argmax over Q)

`students/qlearning_agents.py` — 5 methods

Class / Method	Q	Description
`QLearningAgent.get_q_value(state, action)`	Q3	Return Q-value from Counter (default 0)
`QLearningAgent.compute_value_from_q_values(state)`	Q3	V(s) = max_a Q(s,a); return 0 if terminal
`QLearningAgent.compute_action_from_q_values(state)`	Q3	Best action — break ties randomly
`QLearningAgent.get_action(state)`	Q4	ε-greedy: random with prob ε, else best
`QLearningAgent.update(state, action, next_state, reward)`	Q3	Q-learning update rule

`students/analysis.py` — 5 functions

Function	Q	Description
`question2a()`	Q2	Return (discount, noise, living_reward) → close exit, risk cliff
`question2b()`	Q2	Close exit, avoid cliff
`question2c()`	Q2	Distant depot, risk cliff
`question2d()`	Q2	Distant depot, avoid cliff
`question2e()`	Q2	Never terminate (wander forever)

Getting Started

Step 1 — Run the grid world manually

python mars_rover.py -m

Use arrow keys to move the rover. Press Enter/Space to extract at a sample site or crater. This helps you understand the grid mechanics, especially how solar interference makes you slip sideways.

Step 2 — Try different layouts

python mars_rover.py -m -g base_camp_grid
python mars_rover.py -m -g canyon_grid
python mars_rover.py -m -g maze_grid
python mars_rover.py --list-layouts        # See all 9 available grids

Step 3 — Understand the Mars Grid MDP

Every grid cell is one of:

-1

S = Start | ## = Wall | +1 = Sample | -1 = Crater | Empty = Open terrain

Movement noise (solar interference): When the rover chooses to go north:

80% chance it actually goes north (intended)
10% chance it slips west (perpendicular)
10% chance it slips east (perpendicular)

If the rover would hit a wall, it stays in place. At terminal cells (samples/craters), the only action is extract, which ends the episode.

Step 4 — Understand the key parameters

python mars_rover.py -a value --discount 0.9 --noise 0.2 --living-reward 0.0

Discount (γ) — how much future rewards matter (0 = myopic, 1 = far-sighted)
Noise — probability of slipping sideways (0 = deterministic, 0.2 = default)
Living reward — reward per non-terminal step (negative = hurry, positive = wander)

Homework Tasks (Q1–Q5)

20 pts Q1 — Value Iteration

Implement ValueIterationAgent — an offline planner that computes optimal values using the Bellman equation:

V_{k+1}(s) = max_a  Σ_{s'} T(s,a,s') · [R(s,a,s') + γ · V_k(s')]

Implement three methods in students/value_iteration_agents.py:

run_value_iteration() — the main loop (batch updates!)
compute_q_value_from_values(state, action) — Q from V
compute_action_from_values(state) — best action from V

Important

Use batch value iteration: create a new Counter for V_k+1 each iteration. Compute ALL new values from old values, then replace. Do not update in-place.

python mars_rover.py -a value -i 100 -k 10           # Run VI, 100 sweeps, 10 test episodes
python mars_rover.py -a value -i 5 -g base_camp_grid  # Watch 5 iterations converge
python autograder.py -q q1                              # Grade Q1

20 pts Q2 — Mission Profiles (Policies) Depends on Q1

The Canyon Grid has a close exit (+1), a distant depot (+10), and a crater cliff (-10 each along the bottom):

+10

-10

Choose (discount, noise, living_reward) to produce each behavior. Edit students/analysis.py:

Function	Desired behavior
`question2a`	Prefer close exit (+1), risking the cliff
`question2b`	Prefer close exit (+1), avoiding the cliff
`question2c`	Prefer distant depot (+10), risking the cliff
`question2d`	Prefer distant depot (+10), avoiding the cliff
`question2e`	Avoid all exits and craters (never terminate)

python mars_rover.py -g canyon_grid -a value --discount 0.9 --noise 0.2 --living-reward 0.0
python autograder.py -q q2

20 pts Q3 — Q-Learning

Implement a model-free agent that learns from experience. The Q-learning update rule:

Q(s,a) ← Q(s,a) + α · [r + γ · max_{a'} Q(s',a') - Q(s,a)]

Implement in students/qlearning_agents.py:

get_q_value(state, action) — lookup from Counter
compute_value_from_q_values(state) — max Q over actions
compute_action_from_q_values(state) — best action (break ties randomly!)
update(state, action, next_state, reward) — Q-learning update

Critical hint

Actions the rover has never tried have Q-value = 0 (Counter default). If all tried actions have negative Q-values, an untried action (Q=0) is the best choice. Don't skip unseen actions!

python mars_rover.py -a q -k 5 -m        # Manual Q-learning (5 episodes)
python mars_rover.py -a q -k 100          # Train 100 episodes
python autograder.py -q q3

8 pts Q4 — Epsilon-Greedy Exploration Depends on Q3

Implement get_action(state) — the ε-greedy strategy:

With probability ε: take a random legal action (explore)
Otherwise: take the best action from Q-values (exploit)

Use util.flip_coin(self.epsilon) and random.choice(legal_actions).

Note

A random action may happen to be the best action — that's fine. Don't filter out the optimal action during exploration.

After implementing Q4, the Mars Crawler robot should also work without further code changes:

python -m env.crawler              # Run the crawler — should learn to crawl forward
python autograder.py -q q4

7 pts Q5 — Q-Learning Deployment on Mars Depends on Q3 + Q4

No new code — this question stress-tests your Q3 + Q4 implementation. Deploy your Q-learner on the small_grid. Train for 2000 episodes, then test for 100. The autograder requires a win rate of ≥ 80%, and also verifies that the same Q-Learner trains the Crawler robot to move forward.

python mars_rover.py -a q -k 2010 -g small_grid -e 0.05 --alpha 0.2 --discount 0.8 -q
python autograder.py -q q5

Why larger grids strain tabular Q-Learning

On medium_grid or larger, tabular Q-learning struggles to converge — every cell is a separate state with separate Q-values, so the rover has no way to generalize that "craters are bad everywhere." This motivates function approximation, an out-of-scope follow-up topic.

Algorithm Reference

Value Iteration vs. Q-Learning

Property	Value Iteration (Q1)	Q-Learning (Q3)
Has model?	Yes — knows T(s,a,s') and R	No — learns from experience
Planning	Offline (before acting)	Online (while acting)
Updates	Sweeps over ALL states	Updates ONE (s,a) per step
Convergence	After enough iterations	After enough episodes

Common mistakes to avoid

In-place VI updates — create a new Counter per iteration. Reading and writing self.values in the same sweep gives wrong results.
Not breaking ties randomly (Q3) — if two actions have the same Q, use random.choice(). Without this, the rover gets stuck.
Forgetting unseen actions have Q=0 — the Counter returns 0 for missing keys. If all tried actions have Q < 0, an untried action is optimal.
Accessing Q-values directly — always read through self.get_q_value(), never self.q_values[...]. The accessor is the single public read path.

Testing Your Implementation

Self-Check (no grade)

python check.py              # Run all checks on 3 maps
python check.py -q q1        # Check only Q1

The self-check tests your code on small_grid, base_camp_grid, and canyon_grid. It shows ✅/❌ per check with diagnostic messages. No points or grades — purely for debugging.

Official Autograder (75 points)

python autograder.py                 # Grade all questions
python autograder.py -q q1           # Grade one question
python autograder.py --no-graphics   # Skip any display

The autograder tests across the 9 grid layouts with pre-computed reference solutions. Values must match within 0.01 tolerance. You can run it as often as you like to track your progress — running it does not submit anything.

What the autograder checks

Question	What it verifies
Q1	V(s), Q(s,a), and π(s) at iterations 0, 1, 2, and 100 on multiple grids
Q2	Policy at key cells matches expected direction for each parameter tuple
Q3	Q-values after fixed training episodes match reference (with seeded randomness)
Q4	Best-action frequency falls in expected range for given ε; pure exploit at ε=0
Q5	Win rate ≥ 80% on small_grid; crawler robot moves forward

Useful Command-Line Options

# Layout selection
python mars_rover.py -g canyon_grid
python mars_rover.py -g expedition_grid
python mars_rover.py --list-layouts

# Agent selection
python mars_rover.py -a value -i 100    # Value Iteration, 100 sweeps
python mars_rover.py -a q -k 500        # Q-Learning, 500 episodes
python mars_rover.py -a random           # Random agent

# Parameters
python mars_rover.py --discount 0.9 --noise 0.2 --living-reward -0.04
python mars_rover.py -e 0.3 --alpha 0.5  # Q-learning: ε and α

# Display
python mars_rover.py -t                  # Text mode (no Pygame)
python mars_rover.py -m                  # Manual control (arrow keys)
python mars_rover.py -q                  # Quiet (no transition output)

# GUI keyboard shortcuts
# V = show values   Q = show Q-values   P = show policy   ESC = quit

Troubleshooting

"Method not implemented" / `NotImplementedError`

You haven't filled in a # *** YOUR CODE HERE *** section. Replace the raise_not_defined() call with your implementation.

V(start) is 0 after Value Iteration

You're using in-place updates. Create a new Counter() for V_k+1 each iteration, then replace self.values at the end of each sweep.

Q-Learning rover gets stuck

Check two things: (1) Are you breaking ties randomly in compute_action_from_q_values? (2) Are untried actions (Q=0) being considered? If all tried actions have negative Q, an untried action is best.

Q5 fails but Q3/Q4 pass

Q5 uses different learning parameters (ε=0.05, α=0.2, γ=0.8). Your Q-learning should work with any parameter values, not just the defaults.

Crawler doesn't move

Your Q-learning agent is too specific to the grid world. Make sure get_action, update, etc. don't assume anything about the state format — it's (arm_index, hand_index) for the crawler, not (x, y).

No graphics / Pygame errors

python mars_rover.py -t              # Text-only mode
python mars_rover.py -q              # Quiet mode (no display)

The autograder and check.py never require Pygame.

Points Breakdown

Question	Topic	Points
Q1	Value Iteration	20
Q2	Policy Analysis (Canyon Grid parameters)	20
Q3	Q-Learning	20
Q4	Epsilon-Greedy Exploration	8
Q5	Q-Learning Deployment + Crawler	7
Total		75

Academic Integrity

Do not distribute or publish solutions.
Do not copy solutions from the internet.
You may discuss general approaches with classmates, but write your own code.

Assignment 2: Reinforcement Learning

Mars Rover — Red Planet Rescue

Compatibility & Prerequisites

Install Pygame (optional)

Verify your setup

Learning Objectives

Project Structure

Files you will edit

Exact methods to implement

`students/value_iteration_agents.py` — 3 methods

`students/qlearning_agents.py` — 5 methods

`students/analysis.py` — 5 functions

Getting Started

Step 1 — Run the grid world manually

Step 2 — Try different layouts

Step 3 — Understand the Mars Grid MDP

Step 4 — Understand the key parameters

Homework Tasks (Q1–Q5)

Algorithm Reference

Value Iteration vs. Q-Learning

Common mistakes to avoid

Testing Your Implementation

Self-Check (no grade)

Official Autograder (75 points)

What the autograder checks

Useful Command-Line Options

Troubleshooting

"Method not implemented" / `NotImplementedError`

V(start) is 0 after Value Iteration

Q-Learning rover gets stuck

Q5 fails but Q3/Q4 pass

Crawler doesn't move

No graphics / Pygame errors

Points Breakdown

Academic Integrity

Assignment 2: Reinforcement Learning

Mars Rover — Red Planet Rescue

Compatibility & Prerequisites

Install Pygame (optional)

Verify your setup

Learning Objectives

Project Structure

Files you will edit

Exact methods to implement

students/value_iteration_agents.py — 3 methods

students/qlearning_agents.py — 5 methods

students/analysis.py — 5 functions

Getting Started

Step 1 — Run the grid world manually

Step 2 — Try different layouts

Step 3 — Understand the Mars Grid MDP

Step 4 — Understand the key parameters

Homework Tasks (Q1–Q5)

Algorithm Reference

Value Iteration vs. Q-Learning

Common mistakes to avoid

Testing Your Implementation

Self-Check (no grade)

Official Autograder (75 points)

What the autograder checks

Useful Command-Line Options

Troubleshooting

"Method not implemented" / NotImplementedError

V(start) is 0 after Value Iteration

Q-Learning rover gets stuck

Q5 fails but Q3/Q4 pass

Crawler doesn't move

No graphics / Pygame errors

Points Breakdown

Academic Integrity

`students/value_iteration_agents.py` — 3 methods

`students/qlearning_agents.py` — 5 methods

`students/analysis.py` — 5 functions

"Method not implemented" / `NotImplementedError`