Compatibility & Prerequisites
This project has been tested and works on Linux, macOS, and Windows.
- Python 3.10+ — no external packages required (only the standard library)
- Pygame — optional, for the animated Mars GUI display
Install Pygame (optional)
pip install pygame
The autograder, self-check, and all tests run without Pygame. The GUI is purely for visualization and exploration.
Verify your setup
python mars_rover.py -m
If Pygame is installed, you'll see an animated Mars surface. Use arrow keys to control the rover. Without Pygame, text mode activates automatically.
python mars_rover.py -t # Force text mode (no Pygame needed)
Learning Objectives
- Implement Value Iteration for offline MDP planning with batch Bellman updates.
- Understand how discount factor, noise, and living reward shape optimal policies.
- Implement tabular Q-Learning for model-free reinforcement learning from experience.
- Balance exploration vs. exploitation using epsilon-greedy action selection.
- Verify that the same Q-Learner generalizes from grid worlds to a continuing-task domain (the Crawler robot).
- See firsthand why tabular methods struggle to scale, motivating function approximation as a future topic.
Project Structure
Show / hide directory tree
mars_rover_rl/
├── mars_rover.py # Entry point — run this to start
├── autograder.py # Official autograder (75 pts)
├── check.py # Self-check (3 maps, no grade)
├── project_params.py # Project metadata
├── util.py # Data structures: Counter, sampling
│
├── env/ # Environment package
│ ├── mdp.py # Abstract MDP interface
│ ├── environment.py # Abstract Environment interface
│ ├── mars_grid.py # MarsGrid MDP, layout loader
│ └── crawler.py # Crawler robot domain
│
├── agents/ # Agent base classes
│ └── learning_agents.py # ValueEstimationAgent, ReinforcementAgent
│
├── students/ # ✏ YOUR CODE — only edit these
│ ├── value_iteration_agents.py # ✏ Q1: Value Iteration
│ ├── qlearning_agents.py # ✏ Q3, Q4, Q5: Q-Learning
│ └── analysis.py # ✏ Q2: Policy parameters
│
├── display/ # Visualization (optional)
│ ├── gui_display.py # Pygame animated Mars display
│ ├── text_display.py # ASCII terminal display
│ └── crawler_gui.py # Pygame crawler display
│
├── grading/ # Test framework
│ ├── grading.py # Grade tracker
│ ├── test_parser.py # .test/.solution file parser
│ ├── test_classes.py # Question/TestCase base classes
│ └── mars_test_classes.py # Project-specific test logic
│
├── layouts/ # 9 grid layout files (.lay)
│
└── test_cases/ # 19 test cases (q1–q5)
└── q1/ ... q5/ # CONFIG + .test + .solution per question
Files you will edit
You only need to modify three files inside the students/ directory.
All places are marked with # *** YOUR CODE HERE *** comments.
Do not modify any other files.
| File | What you implement |
|---|---|
students/value_iteration_agents.py EDIT | Value Iteration agent (Q1) |
students/qlearning_agents.py EDIT | Q-Learning agent + ε-greedy exploration (Q3, Q4, Q5) |
students/analysis.py EDIT | Policy parameter choices for Canyon Grid (Q2) |
Exact methods to implement
students/value_iteration_agents.py — 3 methods
| Method | Q | Description |
|---|---|---|
run_value_iteration() | Q1 | Main VI loop — batch Bellman updates for k iterations |
compute_q_value_from_values(state, action) | Q1 | Q(s,a) = Σ T(s,a,s')·[R(s,a,s') + γ·V(s')] |
compute_action_from_values(state) | Q1 | Best action from current values (argmax over Q) |
students/qlearning_agents.py — 5 methods
| Class / Method | Q | Description |
|---|---|---|
QLearningAgent.get_q_value(state, action) | Q3 | Return Q-value from Counter (default 0) |
QLearningAgent.compute_value_from_q_values(state) | Q3 | V(s) = max_a Q(s,a); return 0 if terminal |
QLearningAgent.compute_action_from_q_values(state) | Q3 | Best action — break ties randomly |
QLearningAgent.get_action(state) | Q4 | ε-greedy: random with prob ε, else best |
QLearningAgent.update(state, action, next_state, reward) | Q3 | Q-learning update rule |
students/analysis.py — 5 functions
| Function | Q | Description |
|---|---|---|
question2a() | Q2 | Return (discount, noise, living_reward) → close exit, risk cliff |
question2b() | Q2 | Close exit, avoid cliff |
question2c() | Q2 | Distant depot, risk cliff |
question2d() | Q2 | Distant depot, avoid cliff |
question2e() | Q2 | Never terminate (wander forever) |
Getting Started
Step 1 — Run the grid world manually
python mars_rover.py -m
Use arrow keys to move the rover. Press Enter/Space to extract at a sample site or crater. This helps you understand the grid mechanics, especially how solar interference makes you slip sideways.
Step 2 — Try different layouts
python mars_rover.py -m -g base_camp_grid
python mars_rover.py -m -g canyon_grid
python mars_rover.py -m -g maze_grid
python mars_rover.py --list-layouts # See all 9 available grids
Step 3 — Understand the Mars Grid MDP
Every grid cell is one of:
S = Start | ## = Wall | +1 = Sample | -1 = Crater | Empty = Open terrain
Movement noise (solar interference): When the rover chooses to go north:
- 80% chance it actually goes north (intended)
- 10% chance it slips west (perpendicular)
- 10% chance it slips east (perpendicular)
If the rover would hit a wall, it stays in place. At terminal cells (samples/craters), the only action is extract, which ends the episode.
Step 4 — Understand the key parameters
python mars_rover.py -a value --discount 0.9 --noise 0.2 --living-reward 0.0
- Discount (γ) — how much future rewards matter (0 = myopic, 1 = far-sighted)
- Noise — probability of slipping sideways (0 = deterministic, 0.2 = default)
- Living reward — reward per non-terminal step (negative = hurry, positive = wander)
Homework Tasks (Q1–Q5)
Implement ValueIterationAgent — an offline planner that computes optimal values using the Bellman equation:
V_{k+1}(s) = max_a Σ_{s'} T(s,a,s') · [R(s,a,s') + γ · V_k(s')]
Implement three methods in students/value_iteration_agents.py:
run_value_iteration()— the main loop (batch updates!)compute_q_value_from_values(state, action)— Q from Vcompute_action_from_values(state)— best action from V
Use batch value iteration: create a new Counter for Vk+1 each iteration. Compute ALL new values from old values, then replace. Do not update in-place.
python mars_rover.py -a value -i 100 -k 10 # Run VI, 100 sweeps, 10 test episodes
python mars_rover.py -a value -i 5 -g base_camp_grid # Watch 5 iterations converge
python autograder.py -q q1 # Grade Q1
The Canyon Grid has a close exit (+1), a distant depot (+10), and a crater cliff (-10 each along the bottom):
Choose (discount, noise, living_reward) to produce each behavior. Edit students/analysis.py:
| Function | Desired behavior |
|---|---|
question2a | Prefer close exit (+1), risking the cliff |
question2b | Prefer close exit (+1), avoiding the cliff |
question2c | Prefer distant depot (+10), risking the cliff |
question2d | Prefer distant depot (+10), avoiding the cliff |
question2e | Avoid all exits and craters (never terminate) |
python mars_rover.py -g canyon_grid -a value --discount 0.9 --noise 0.2 --living-reward 0.0
python autograder.py -q q2
Implement a model-free agent that learns from experience. The Q-learning update rule:
Q(s,a) ← Q(s,a) + α · [r + γ · max_{a'} Q(s',a') - Q(s,a)]
Implement in students/qlearning_agents.py:
get_q_value(state, action)— lookup from Countercompute_value_from_q_values(state)— max Q over actionscompute_action_from_q_values(state)— best action (break ties randomly!)update(state, action, next_state, reward)— Q-learning update
Actions the rover has never tried have Q-value = 0 (Counter default). If all tried actions have negative Q-values, an untried action (Q=0) is the best choice. Don't skip unseen actions!
python mars_rover.py -a q -k 5 -m # Manual Q-learning (5 episodes)
python mars_rover.py -a q -k 100 # Train 100 episodes
python autograder.py -q q3
Implement get_action(state) — the ε-greedy strategy:
- With probability ε: take a random legal action (explore)
- Otherwise: take the best action from Q-values (exploit)
Use util.flip_coin(self.epsilon) and random.choice(legal_actions).
A random action may happen to be the best action — that's fine. Don't filter out the optimal action during exploration.
After implementing Q4, the Mars Crawler robot should also work without further code changes:
python -m env.crawler # Run the crawler — should learn to crawl forward
python autograder.py -q q4
No new code — this question stress-tests your Q3 + Q4 implementation. Deploy your Q-learner on the
small_grid. Train for 2000 episodes, then test for 100. The autograder requires a
win rate of ≥ 80%, and also verifies that the same Q-Learner trains the
Crawler robot to move forward.
python mars_rover.py -a q -k 2010 -g small_grid -e 0.05 --alpha 0.2 --discount 0.8 -q
python autograder.py -q q5
On medium_grid or larger, tabular Q-learning struggles to converge — every cell is
a separate state with separate Q-values, so the rover has no way to generalize that
"craters are bad everywhere." This motivates function approximation, an
out-of-scope follow-up topic.
Algorithm Reference
Value Iteration vs. Q-Learning
| Property | Value Iteration (Q1) | Q-Learning (Q3) |
|---|---|---|
| Has model? | Yes — knows T(s,a,s') and R | No — learns from experience |
| Planning | Offline (before acting) | Online (while acting) |
| Updates | Sweeps over ALL states | Updates ONE (s,a) per step |
| Convergence | After enough iterations | After enough episodes |
Common mistakes to avoid
- In-place VI updates — create a new Counter per iteration. Reading and writing
self.valuesin the same sweep gives wrong results. - Not breaking ties randomly (Q3) — if two actions have the same Q, use
random.choice(). Without this, the rover gets stuck. - Forgetting unseen actions have Q=0 — the Counter returns 0 for missing keys. If all tried actions have Q < 0, an untried action is optimal.
- Accessing Q-values directly — always read through
self.get_q_value(), neverself.q_values[...]. The accessor is the single public read path.
Testing Your Implementation
Self-Check (no grade)
python check.py # Run all checks on 3 maps
python check.py -q q1 # Check only Q1
The self-check tests your code on small_grid, base_camp_grid, and canyon_grid.
It shows ✅/❌ per check with diagnostic messages. No points or grades — purely for debugging.
Official Autograder (75 points)
python autograder.py # Grade all questions
python autograder.py -q q1 # Grade one question
python autograder.py --no-graphics # Skip any display
The autograder tests across the 9 grid layouts with pre-computed reference solutions. Values must match within 0.01 tolerance. You can run it as often as you like to track your progress — running it does not submit anything.
What the autograder checks
| Question | What it verifies |
|---|---|
| Q1 | V(s), Q(s,a), and π(s) at iterations 0, 1, 2, and 100 on multiple grids |
| Q2 | Policy at key cells matches expected direction for each parameter tuple |
| Q3 | Q-values after fixed training episodes match reference (with seeded randomness) |
| Q4 | Best-action frequency falls in expected range for given ε; pure exploit at ε=0 |
| Q5 | Win rate ≥ 80% on small_grid; crawler robot moves forward |
Useful Command-Line Options
# Layout selection
python mars_rover.py -g canyon_grid
python mars_rover.py -g expedition_grid
python mars_rover.py --list-layouts
# Agent selection
python mars_rover.py -a value -i 100 # Value Iteration, 100 sweeps
python mars_rover.py -a q -k 500 # Q-Learning, 500 episodes
python mars_rover.py -a random # Random agent
# Parameters
python mars_rover.py --discount 0.9 --noise 0.2 --living-reward -0.04
python mars_rover.py -e 0.3 --alpha 0.5 # Q-learning: ε and α
# Display
python mars_rover.py -t # Text mode (no Pygame)
python mars_rover.py -m # Manual control (arrow keys)
python mars_rover.py -q # Quiet (no transition output)
# GUI keyboard shortcuts
# V = show values Q = show Q-values P = show policy ESC = quit
Troubleshooting
"Method not implemented" / NotImplementedError
You haven't filled in a # *** YOUR CODE HERE *** section. Replace the raise_not_defined() call with your implementation.
V(start) is 0 after Value Iteration
You're using in-place updates. Create a new Counter() for Vk+1 each iteration, then replace self.values at the end of each sweep.
Q-Learning rover gets stuck
Check two things: (1) Are you breaking ties randomly in compute_action_from_q_values?
(2) Are untried actions (Q=0) being considered? If all tried actions have negative Q, an untried action is best.
Q5 fails but Q3/Q4 pass
Q5 uses different learning parameters (ε=0.05, α=0.2, γ=0.8). Your Q-learning should work with any parameter values, not just the defaults.
Crawler doesn't move
Your Q-learning agent is too specific to the grid world. Make sure get_action, update, etc. don't assume anything about the state format — it's (arm_index, hand_index) for the crawler, not (x, y).
No graphics / Pygame errors
python mars_rover.py -t # Text-only mode
python mars_rover.py -q # Quiet mode (no display)
The autograder and check.py never require Pygame.
Points Breakdown
| Question | Topic | Points |
|---|---|---|
| Q1 | Value Iteration | 20 |
| Q2 | Policy Analysis (Canyon Grid parameters) | 20 |
| Q3 | Q-Learning | 20 |
| Q4 | Epsilon-Greedy Exploration | 8 |
| Q5 | Q-Learning Deployment + Crawler | 7 |
| Total | 75 |
Academic Integrity
- Do not distribute or publish solutions.
- Do not copy solutions from the internet.
- You may discuss general approaches with classmates, but write your own code.