File size: 7,012 Bytes
a52f96d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
# Teacher Agent Development System

A complete teacher agent system for developing and testing meta-RL curriculum learning algorithms independently.

## Overview

This system provides:
- **Mock Student Agent**: Realistic student with learning + forgetting (Ebbinghaus curve)
- **Mock Task Generator**: Simple task generator with multiple topics and difficulties
- **Teacher Agent**: UCB (Upper Confidence Bound) bandit algorithm for curriculum sequencing
- **Training Loop**: Complete training system with evaluation
- **Visualization**: Plotting utilities for analysis

## Installation

```bash
pip install -r requirements.txt
```

## Quick Start

### 1. Run Tests

```bash
python test_teacher.py
```

This verifies:
- Student learns with practice
- Student forgets over time
- Teacher explores actions
- Teacher exploits good actions

### 2. Train Teacher Agent

```bash
python train_teacher.py
```

Expected output:
```
======================================================================
TEACHER AGENT TRAINING
======================================================================
Iterations: 500
Evaluation tasks: 15
Action space: 30 actions
======================================================================
Iteration   0 | Student Acc: 0.267 | Avg Reward: 0.850 | Action: his-ea-N
Iteration  50 | Student Acc: 0.453 | Avg Reward: 1.120 | Action: sci-me-R
...
Iteration 500 | Student Acc: 0.812 | Avg Reward: 0.780 | Action: lit-ha-N
```

### 3. Generate Visualizations

```python
from train_teacher import train_teacher
from visualize import *

# Train teacher
history, teacher, student = train_teacher(num_iterations=500)

# Generate plots
plot_learning_curves(history)
plot_curriculum_heatmap(history)
plot_action_distributions(teacher)
```

### 4. Compare with Baselines

```python
from train_teacher import train_teacher, train_baseline_random, train_baseline_fixed
from visualize import plot_comparison

# Train all strategies
history_teacher, _, _ = train_teacher(num_iterations=500, verbose=False)
history_random = train_baseline_random(num_iterations=500)
history_fixed = train_baseline_fixed(num_iterations=500)

# Compare
plot_comparison({
    'teacher': history_teacher,
    'random': history_random,
    'fixed': history_fixed
})
```

## Architecture

### Components

1. **interfaces.py**: Shared data structures (Task, StudentState, TeacherAction) and ABC interfaces
2. **mock_student.py**: Student agent with learning (improves with practice) and forgetting (Ebbinghaus curve)
3. **mock_task_generator.py**: Simple task generator with 5 topics Γ— 3 difficulties
4. **teacher_agent.py**: UCB bandit algorithm for selecting curriculum actions
5. **train_teacher.py**: Main training loop connecting all components
6. **test_teacher.py**: Unit tests for all components
7. **visualize.py**: Plotting utilities for analysis

### Action Space

Teacher selects from **30 actions**:
- 5 topics: history, science, literature, geography, current_events
- 3 difficulties: easy, medium, hard
- 2 options: new material or review

### Student Model

- **Learning**: Skill improves with practice: `new_skill = old_skill + learning_rate * difficulty_factor * (1 - old_skill)`
- **Forgetting**: Retention decays over time: `retention = exp(-forgetting_rate * time_since_practice)`
- **Effective Skill**: `effective_skill = base_skill * retention`
- **Accuracy**: `accuracy = 0.25 + 0.75 * effective_skill` (25% is random guessing on 4-choice MCQ)

### Teacher Algorithm

**UCB (Upper Confidence Bound)**:
```
UCB(a) = estimated_reward(a) + exploration_bonus Γ— sqrt(log(total_pulls) / pulls(a))
```

- Balances exploration (trying new actions) vs exploitation (using known-good actions)
- Exploration bonus controls adventurousness (higher = more exploration)

### Reward Function

```
reward = improvement + difficulty_bonus + review_bonus + review_penalty

where:
- improvement = accuracy_after - accuracy_before
- difficulty_bonus = easy:0.5, medium:1.0, hard:2.0
- review_bonus = 1.0 if review and improvement > 0
- review_penalty = -0.5 if review and accuracy > 0.9 (wasted review)
```

## Expected Behavior

### Early Iterations (0-100)
- Teacher explores all topics/difficulties
- Tries mostly easy tasks (build foundation)
- High exploration, low exploitation

### Mid Iterations (100-300)
- Starts increasing difficulty
- Discovers which topics student struggles with
- Begins strategic reviewing

### Late Iterations (300-500)
- Mostly medium/hard tasks (student is skilled)
- Reviews topics just before forgetting threshold
- High exploitation of known-good curriculum

### Emergent Behaviors
- Teacher gives harder tasks as student improves
- Teacher reviews topics ~30-50 iterations after practice (optimal timing)
- Teacher specializes in topics student finds difficult

## Success Criteria

After training, you should see:
- βœ… Student reaches >70% accuracy by iteration 500
- βœ… Teacher discovers: easy tasks first β†’ harder tasks later
- βœ… Teacher learns to review before forgetting
- βœ… Teacher reward stabilizes (not just random)

## File Structure

```
teacher_agent_dev/
β”œβ”€β”€ interfaces.py           # Shared data structures and ABC interfaces
β”œβ”€β”€ mock_student.py         # Mock student with learning + forgetting
β”œβ”€β”€ mock_task_generator.py  # Simple task generator
β”œβ”€β”€ teacher_agent.py        # MAIN: UCB bandit teacher algorithm
β”œβ”€β”€ train_teacher.py        # Training loop
β”œβ”€β”€ test_teacher.py         # Unit tests
β”œβ”€β”€ visualize.py            # Plotting utilities
β”œβ”€β”€ requirements.txt        # Dependencies
└── README.md              # This file
```

## Customization

### Adjust Student Learning
```python
student = MockStudentAgent(
    learning_rate=0.15,    # How fast student learns (higher = faster)
    forgetting_rate=0.05   # How fast student forgets (higher = faster)
)
```

### Adjust Teacher Exploration
```python
teacher = TeacherAgent(
    exploration_bonus=2.0  # Higher = more exploration, Lower = more exploitation
)
```

### Add More Topics/Difficulties
Edit `mock_task_generator.py` to add more templates or modify `teacher_agent.py` to adjust action space.

## Troubleshooting

**Issue**: Student doesn't learn
- **Solution**: Increase `learning_rate` in MockStudentAgent

**Issue**: Teacher doesn't explore
- **Solution**: Increase `exploration_bonus` in TeacherAgent

**Issue**: Forgetting too fast/slow
- **Solution**: Adjust `forgetting_rate` in MockStudentAgent

**Issue**: Division by zero errors
- **Solution**: UCB handles cold start automatically (untried actions selected first)

## Next Steps

1. **Replace mock components**: When teammates finish real student/task generator, swap out mock components
2. **Tune hyperparameters**: Adjust learning_rate, forgetting_rate, exploration_bonus
3. **Experiment with algorithms**: Try different bandit algorithms (Thompson Sampling, Ξ΅-greedy)
4. **Add features**: More sophisticated reward functions, state representations, etc.

## License

MIT