Unexpected hidden activation dimensions in convolutional neural network

John Greenhall
John Greenhall on 14 Apr 2021
Answered: Hrishikesh Borate on 20 Apr 2021
I am attempting to build a multi-layer convolutional neural network, with multiple conv layers (and pooling, dropout, activation layers in between). However, I am a bit confused about the sizes of the weights and the activations from each conv layer.
For simplicity, let's assume each conv layer consists of M filters of size m x m. I define each conv layer using convolution2dLayer([m,m],M,'Padding','Same').
The first layer takes in a single image and outputs M images (4D array with last dimension M). The first layer also has weights of dimension m x m x 1 x M. This is all what I would expect.
The subsequent layers are where I am getting confused. I expect the 2nd conv layer to take in M images, and apply M filters of size m x m (weight dimension m x m x 1 x M), resulting in an output with M^2 images, as we apply all M filters to each of the M inputs. Instead, the weights have dimensions m x m x M x M, and there are only M output images (according to the "activations" function).
The later conv layers are the same as the 2nd layer, where the weights are size m x m x M x M, and there are only M output images from each layer.
Am I missing something?

Answers (1)

Hrishikesh Borate
Hrishikesh Borate on 20 Apr 2021
In a convolution layer, the depth of a filter is equal to the depth of the input or the number of input channels. Hence, the dimension of weights in a convolution layer can be calculated as :-
(filter height) x (filter width) x (input depth or number of input channels) x (number of filters).
For example, if input to a network is an image with single channel and each convolution layer is defined as :-
convolution2dLayer([m,m], M, 'Padding', 'same');
Under the assumption that the network contains only convolution layers, the weights in the first convolution layer will have dimension = m x m x 1 x M (as the input depth = 1) and the output of this layer will have dimension = (input image height) x (input image width) x (number of filters = M). These output activations will be the input to second convolution layer, hence the weights of the second convolution layer will have the following dimension :-
(filter height = m) x (filter width = m) x (input depth = M) x (number of filters = M)
Similarly, the dimension of weights in subsequent convolution layers will be m x m x M x M.
For more information, refer to convolution2dLayer.

