refactor: separate optimizer algorithms from training strategies#50
refactor: separate optimizer algorithms from training strategies#50roomote-v0[bot] wants to merge 1 commit intomainfrom
Conversation
Split the coupled optimizer+training-loop design into two distinct concerns: 1. Optimizer (optimizer.py): Pure parameter update algorithms (SGD, Adam, RMSprop, etc.) that only know how to update parameters given gradients. These can now be reused across different training strategies. 2. Trainer (trainers.py): Training strategies that handle the training loop (batching, epochs, forward/backward passes) and delegate parameter updates to an Optimizer. New classes: - SGD, MomentumSGD, NesterovMomentumSGD, RMSpropOptimizer, AdamOptimizer, SignSGDOptimizer (pure optimizers) - SupervisedTrainer: standard batch training for feedforward models - RecurrentTrainer: supports gradient accumulation across batches, enabling stable RNN training Backward-compatible wrappers (GradientDescent, Adam, RMSprop, etc.) preserve the existing API so all guides and tests continue working.
Found 2 issues to address:
Mention @roomote in a comment to request specific changes to this pull request or fix all unresolved issues. |
| def optimize_batch(self, model: Model, δEδps: ParameterSet, epoch: int, iteration: int): | ||
| self.optimizer.step(model.get_parameters(), δEδps, epoch, iteration) |
There was a problem hiding this comment.
optimize_batch() is dead code on all backward-compat wrappers. SupervisedTrainer.train() calls self.optimizer.step() directly and never dispatches through optimize_batch(). These methods are unreachable -- anyone subclassing BatchedGradientOptimizer and overriding optimize_batch() (which was the old extension point) would silently have their override ignored.
Consider either routing the training loop through optimize_batch() so the old override contract is honoured, or removing these methods entirely to avoid confusion.
Fix it with Roo Code or mention @roomote and request a fix.
| batches = n // self.batch_size | ||
| history = [] | ||
| model.set_phase(Phase.Training) | ||
| self.optimizer.initialize(model.get_parameters()) |
There was a problem hiding this comment.
self.optimizer.initialize() is called unconditionally at the start of every train() invocation. For stateful optimizers (MomentumSGD, AdamOptimizer, RMSpropOptimizer), this zeroes out momentum buffers and moment estimates each time. The old code used a self.first flag set once in __init__, so calling optimize() multiple times on the same optimizer instance preserved accumulated state. With this change, any code that calls train()/optimize() more than once (e.g., warm-starting, curriculum learning, or resuming training) silently loses all optimizer state.
A straightforward fix: track an initialized flag on the optimizer and skip re-initialization if it has already been called, or let the caller decide via a reset_state parameter.
Fix it with Roo Code or mention @roomote and request a fix.
Summary
This PR refactors the coupled optimizer+training-loop design into two distinct concerns, making it possible to train RNN models (and in the future, GANs) with gradient accumulation, while keeping full backward compatibility with the existing API.
Problem
The existing design combines optimization algorithms (SGD, Adam, etc.) with the training loop (batching, epochs) into a single
BatchedGradientOptimizerclass hierarchy. Gradients are computed once per batch and immediately applied -- there is no way to accumulate gradients across batches, which is needed for stable RNN training.Solution
1. Pure Optimizer algorithms (
optimizer.py)New classes that only know how to update parameters given gradients:
SGD- simple gradient descentMomentumSGD- momentum-based SGDNesterovMomentumSGD- Nesterov accelerated gradientRMSpropOptimizer- RMSprop adaptive learning rateAdamOptimizer- Adam (adaptive moments)SignSGDOptimizer- sign-based SGDEach implements a simple
step(parameters, gradients, epoch, iteration)interface.2. Trainer strategies (
trainers.py)New classes that handle the training loop and delegate parameter updates to an Optimizer:
SupervisedTrainer- standard batch training for feedforward models (MLPs, CNNs)RecurrentTrainer- supports gradient accumulation across multiple batches before updating, enabling stable RNN training3. Backward compatibility
Wrapper classes (
GradientDescent,Adam,RMSprop, etc.) preserve the existing API:Testing
View task on Roo Code Cloud