View in #questions-forum on Slack

@Sayantan_Auddy:Hi everyone. I have a couple of questions related to the use of EWC and VLC with resnet-based network architectures and when a squared error loss is used:

- For computing the Fisher Information for EWC or online EWC in a scenario where a MSE loss (or Huber loss) is used, is there anything special that one needs to consider?
- Should the Batch Normalization parameters be treated at par with the weights and biases (which we try to protect) or are the BN parameters ignored for continual learning?
- How do VLC and EWC perform for models with deeper resnet-based architectures?

If anyone has any suggestions based on your own experience or general advise about any of these questions, I would really appreciate your input. Thanks in advance.

@Martin_Mundt:Hey, I realized no one was answering to this. So I’ll try to give it my best, even though I’m not sure I have a good answer.

This is really just a wild guess, but assuming that EWC/SI etc. are more about estimating parameter importance and then regularizing by using an additional loss term on parameters, I don’t see how there would be significant changes when you move from e.g. classification to regression etc.

I guess batch-norm accumulates statistics of the dataset, so in that sense they are as prone to suffering from interference/forgetting as other parameters. But then I guess whether the forgetting is catastrophic might depend. I believe that BN usually just tracks running averages of whatever it has seen. In that sense, while statistics get overwritten, it seems that this would be more gradual, than catastrophic. But I’m not very sure, because BN isn’t very well understood yet.

I think this question is more involved than you may initially realize. The reason for this is that variational continual learning (and other similar methods) make use bayesian neural networks. In contrast to your average CNN, those have a full distribution on weights, from which they sample at all times. This makes it inherently more complicated (but perhaps also partially unnecessary) to “scale” this to deep resnets.

@Sayantan_Auddy:Hi @Martin_Mundt, Thanks for responding

- For EWC, the additional loss term involves the computation of the Fisher Information matrix, which is equal to the negative expected Hessian of the log likelihood [see here]. So yes, I agree that there should not be any changes (at least theoretically) while moving from classification to regression. I wanted to be sure
- The BN parameters I was referring to are the learnable scaling parameters γ and β [formula], not the running averages of the mean and variance. Since these parameters are much fewer that the weights/biases, I was just curious if these should be treated at par with the other parameters for CL. I will try this out on a small dataset and see what happens.
- I am aware that VCL uses Bayesian Layers where parameters are sampled from a distribution. But I’m not sure that I understand why this makes it unnecessary to scale to deeper architectures. As far as I know, the effect of sampling the parameters instead of using point estimates has a regularizing effect, similar to Dropout, and allows for the direct estimation of the prediction uncertainty. I realize that a Bayesian Neural Network (using Mean Field VI) has twice the number of parameters compared to a regular network with the same architecture, and this can make the training process difficult. But I have not come across examples where techniques similar to VCL have been applied to tasks which require deeper/more involved network architectures. My question was an attempt to understand whether it is advisable to directly convert a complex network architecture to its Bayesian counterpart and thereafter use VCL for continual learning.

Thanks again for taking the time to answer. I would be happy to continue this conversation if you or anyone else has some additional inputs

@Martin_Mundt:Hey @Sayantan_Auddy, thanks for initiating the discussion. I think it’s cool to brainstorm about these aspects .

- Oh right, I often turn these off myself so I forgot about them. I suppose that you are right, these learnable parameters do define an affine transformation, which would then be subject to catastrophic forgetting in principle. I’m curious to see what you will observe in your investigation. Thanks for pointing this out!
- Yes you are right, my initial response wasn’t exactly precise here, I should have clarified what I mean further. I think you’ve already captured the essence of what I was trying to say however: “realize that BNN (using MF VI) has twice the number of parameters … can make the training process difficult”. What I was trying to get at was that I haven’t seen “more complex” continual learning applications with full BNNs yet . And when I say BNNs, as you said, I mean the model weights of each layer are the random variable being marginalised (not just the latent variable z as is done in e.g. VAEs, or a Bernoulli -distribution like approximation such as Monte-Carlo Dropout). I’ve only seen one recent paper (I can’t remember what it was called … I’ll try to think about it again), where this was successfully done without training issues for very deep complex architectures such as ResNet (at the same final accuracy of a non-Bayesian CNN). I presently suspect that this is the reason holding back their use for “complex” continual learning things. It might be that this statement isn’t fully up to date, so if you know of literature that actually trains a full BNN with full distributions from which values are sampled in complicated deep nets, I’d appreciate a link.