Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
Hi @eschibli, First of all, thanks for opening this PR! For the linting, it will make your life much easier if you follow these instruction, or you can also run it manually |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #2555 +/- ##
==========================================
- Coverage 95.69% 95.63% -0.07%
==========================================
Files 156 156
Lines 16926 16940 +14
==========================================
+ Hits 16197 16200 +3
- Misses 729 740 +11 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Thanks @madtoinou. I was not able to get Gradle running on my machine and didn't realize ruff was that easy to set up so sorry for spamming your test pipeline. I don't believe the failing mac build is a result of my changes so it should be good for review now. |
|
Hi @eschibli, thanks for the PR. Yes, the failing mac tests are unrelated to your PR, we're working on it :). |
|
Understood Dennis |
madtoinou
left a comment
There was a problem hiding this comment.
It looks great, thank you for this nice PR @eschibli! Some minor comment about the order of the operations/projections to make the flow more intuitive.
Could you also extend the TSMixer notebook to include a section where the difference in performance with "project_first_layer=True/False" and future covariates can be visualized?
| # In the original paper this was not implimented for future covariates, | ||
| # but rather than ignoring them or raising an error we remap them to the input time dimension. | ||
| # Suboptimal but may be useful in some cases. | ||
| elif self.future_cov_dim: |
There was a problem hiding this comment.
to make it a bit more intuitive, I would move this code below, inside the if self.future_cov_dim and change the condition to if not self.project_first_layer in order to group the operation on each kind of features:
- "target"; project to output time dimension in the first layer if
project_first_layer = Trueotherwise we stay in input time dimension - "target"; do the feature_mixing_hist (not changed)
- "fut_cov"; project the future covariates to input time dimension if
project_first_layer=False(the logic you added) - concatenate the future covariates to the target features (not changed)
- static covariates (not changed)
- "target"; projection to the output time dimension if it did not occur earlier
- "target"; application of fc_out, critical for probabilistic forecasts
There was a problem hiding this comment.
This implementation is out-of-date with the current version of the PR, and I think the current version makes more sense.
There was a problem hiding this comment.
I set it up this way originally because the TSMixer performance test was failing if I didn't generate the layers in exactly the same order. In this PR I have relaxed the tolerance as it wasn't stable.
|
@madtoinou are you happy with these changes? |
|
@madtoinou @dennisbader sorry to nag but could you please clarify if you are unhappy with this addition or if more changes are needed to approve this? |
|
@madtoinou could you give an update on this one? :) |
|
Hi @eschibli, Sorry for the delay, I started reviewing but it took me longer than expected. These changes are welcomed, I have the impression that it will bring considerable performance gain but I want to make sure that the current behavior is still accessible and reproducible. |
|
Sorry to nag but is there anything I could do to get this merged? It should reproduce current behavior when |
|
Hi @eschibli and really sorry that this got a bit lost. We'll review it until next Friday (most likely on next Friday). |
dennisbader
left a comment
There was a problem hiding this comment.
Thanks a lot for the PR and the updates at @eschibli, it looks really good 🚀
We're almost there :)
I added a couple of minor suggestions and one where we should discuss a bit more (the one where encoder_to_decoder is not applied at a different step compared to the original implementation).
Sorry again for the delay on this one 🙏
| "normalize_before": normalize_before, | ||
| } | ||
|
|
||
| # Projects from the input time dimension to the output time dimension |
There was a problem hiding this comment.
self.fc_hist is not used anymore (I guess replaced by the new encoder_to_decoder?). We should remove one of the two
| num_encoder_blocks | ||
| The number of mixer blocks in the encoder. | ||
| num_decoder_blocks | ||
| The number of mixer blocks in the decoder. |
There was a problem hiding this comment.
These are not available, and description of project_after_n_blocks is missing
| num_encoder_blocks | |
| The number of mixer blocks in the encoder. | |
| num_decoder_blocks | |
| The number of mixer blocks in the decoder. | |
| num_blocks | |
| The number of mixer blocks in the model. The number includes the first block and all subsequent blocks. |
| # Raise exception for nonsensical number of encoder and decoder blocks | ||
| if ( | ||
| num_encoder_blocks < 0 | ||
| or num_decoder_blocks < 0 | ||
| or (num_encoder_blocks + num_decoder_blocks != self.num_blocks) | ||
| ): | ||
| raise_log( | ||
| ValueError( | ||
| f"Invalid number of encoder and decoder blocks. " | ||
| f"project_after_n_blocks must be between 0 and {self.num_blocks} inclusive." | ||
| ), | ||
| ) |
There was a problem hiding this comment.
We should move this sanity check to __init__, simply check that 0 <= project_after_n_blocks <= num_blocks, and rephrase the message to
f"`project_after_n_blocks` must be between 0 and {num_blocks} inclusive."`
| # (B, C, S) -> (B, 1, C * S) | ||
| x_static_hist = x_static.reshape(x_static.shape[0], 1, -1) | ||
| # repeat to match lookback time dim: (B, 1, C * S) -> (B, L, C * S) | ||
| x_static_hist = x_static_hist.repeat(1, self.input_chunk_length, 1) | ||
|
|
||
| # (B, C, S) -> (B, 1, C * S) | ||
| x_static_future = x_static.reshape(x_static.shape[0], 1, -1) | ||
| # repeat to match horizon time dim: (B, 1, C * S) -> (B, T, C * S) | ||
| x_static_future = x_static_future.repeat(1, self.output_chunk_length, 1) |
There was a problem hiding this comment.
This can be simplified. Also, it would be nice to only create x_static_hist / x_static_future if they are really required (e.g. future only if the encoder is not used).
Any type of operation that can be avoided has positive impact on the model throughput :)
| # (B, C, S) -> (B, 1, C * S) | |
| x_static_hist = x_static.reshape(x_static.shape[0], 1, -1) | |
| # repeat to match lookback time dim: (B, 1, C * S) -> (B, L, C * S) | |
| x_static_hist = x_static_hist.repeat(1, self.input_chunk_length, 1) | |
| # (B, C, S) -> (B, 1, C * S) | |
| x_static_future = x_static.reshape(x_static.shape[0], 1, -1) | |
| # repeat to match horizon time dim: (B, 1, C * S) -> (B, T, C * S) | |
| x_static_future = x_static_future.repeat(1, self.output_chunk_length, 1) | |
| # (B, C, S) -> (B, 1, C * S) | |
| x_static = x_static.reshape(x_static.shape[0], 1, -1) | |
| # repeat to match lookback time dim: (B, 1, C * S) -> (B, L, C * S) | |
| x_static_hist = x_static.repeat(1, self.input_chunk_length, 1) | |
| # repeat to match horizon time dim: (B, 1, C * S) -> (B, T, C * S) | |
| x_static_future = x_static.repeat(1, self.output_chunk_length, 1) |
|
|
||
| # Project time dimension (B, L, H_S) -> (B, T, H_S) | ||
| x = x.transpose(1, 2) | ||
| x = self.encoder_to_decoder(x) # Linear map |
There was a problem hiding this comment.
Hmm.. The encoder_to_decoder (previously called fc_hist) is now applied after feature_mixing_hist.
This means feature_mixing_hist is only applied on the input chunk (L), instead of on the output chunk (T). From Figure 4 in the paper, the temporal projection for historic features is applied also before the feature_mixing_hist.
Meaning, users will probably not be able to reproduce results from earlier Darts versions.
| for mixing_layer in self.conditional_mixer: | ||
| # conditional mixer layers with static covariates (B, T, 2 * H_S), (B, T, C * S) -> (B, T, H_S) | ||
| x = mixing_layer(x, x_static=x_static) | ||
| # Apply decoder mixer layers |
There was a problem hiding this comment.
let's add the shape transformation
| x_static = x_static.reshape(x_static.shape[0], 1, -1) | ||
| # repeat to match horizon (B, 1, C * S) -> (B, T, C * S) | ||
| x_static = x_static.repeat(1, self.output_chunk_length, 1) | ||
| # Apply encoder mixer layers |
There was a problem hiding this comment.
can you add the shape transformation (e.g. (B, ...) -> (B, ...))
Checklist before merging this PR:
Implements #2510
Summary
Adds the option to project to the output temporal space at the end of TS-Mixer, rather than the beginning. This was how most of the results in the original google-research paper were achieved (ie, the architecture in Fig #1 of the paper). This may allow higher performance in cases where past covariates are important by allowing a more direct series of residual connections along the input time dimension.
I allowed support for future covariates by instead projecting them into the lookback temporal space, but this probably won't perform well in cases where they are more important than the historical targets and past covariates.
Other Information
The original paper and source code do not clarify whether the final temporal projection should go before or after the final feature projection as they hardcoded
hidden_sizetooutput_dimand therefore did not have need a final feature projection. I erred on the side of putting the temporal projection first, as otherwise the commonoutput_dim==1could lead to unexpected, catastrophic compression before the temporal projection step.