How to implement PyTorch's Linear layer in Matlab?

Hello,
How can I implement PyTorch's Linear layer in Matlab?
The problem is that Linear does not flatten its inputs whereas Matlab's fullyConnectedLayer does, so the two are not equivalent.
Thx,
J

Answers (4)

Matt J
Matt J on 11 Feb 2023
Edited: Matt J on 11 Feb 2023
One possibility might be to express the linear layer as a cascade of fullyConnectedLayer followed by a functionLayer. The functionLayer can reshape the flattened input back to the form you want,
layer = functionLayer(@(X)reshape(X,[h,w,c]));

9 Comments

Thx.
First, your suggestion wouldn't work because the output has too few elements, i.e. n_out x 1, whereas Linear outputs n_out x n_ch (disregarding the batch dimension).
Second, the problem is that the semantics is different (not only the shape):
The number of weights in PyTorch is n_in * n_out, where n_in is the size of the last input dimension and n_out is the size of the output and every slice (page) of the input is multiplied by this matrix, so different slices do not impact each other.
The number of weights in Matlab is (n_c * n_s * n_u) * n_out, where n_c, n_s and n_u are the sizes of the channel, spatial and unnamed dimensions, respectively and n_out is the size of the output. So, the flattened input is multiplied by this matrix and the different channels impact each other.
It's possible to hack the weights by setting the cross-channel ones to zeros, but then learning should be probably hacked as well, because the cross-elements might change during learning.
PS
  1. There's also the pagemtimes operation available for dlarrays, but the input must be stripped off of labels and then the output must be relabeled again. Not sure how this'd impact automatic differentiation.
  2. Just checked that PyTorch uses matmul (batched matrix multiply) for Linear when it cannot use standard matrix multiplications.
  3. Matlab's matmul implementation in ONNX importer just loops over the third to last dimensions doing matrix multiplications.
J
So you're saying you want a separate linear transform applied independently to each of the channels? Then you could use a groupedConvolution2dLayer, where the filtersize is the size of a complete channel,
layer = groupedConvolution2dLayer(filterSize,numFiltersperGroup,'channel-wise')
and numFiltersperGroup is chosen depending on how many outputs your linear transforms are supposed to have.
Thx.
Unfortunately, this won't work (at least not as is). There're two issues:
  1. The 2d-convolution performs element-wise multiplication of the kernel with the input and sums all the intermediate results together which is not what matrix multiplication does.
  2. The kernel would need to be duplicated per channel and then the issue of divergence during training still might bite.
Still, it's possible to perform matrix multiplication using 1d-convolution. See my self answer below.
The 2d-convolution performs element-wise multiplication of the kernel with the input and sums all the intermediate results together which is not what matrix multiplication does.
Yes it is. That's exactly what multiplication does.
The kernel would need to be duplicated per channel and then the issue of divergence during training still might bite.
You'll need to elaborate on that. Are you trying to transform each channel with the same matrix or avoid it? The scheme I outlined for you applies a different matrix to each channel, but if you want to apply the same matrix, you can reshape the input X (assume dimensions are HxWxC) so that the channel dimension becomes spatial,
X=reshape(X,[H*W,C])
and then apply a conv2dLayer with an (H*W)x1xN filter with no padding. It has the exact same effect. as multiplying each column (i.e., channel) of X with an Nx(H*W) matrix.
The 2d-convolution performs element-wise multiplication of the kernel with the input and sums all the intermediate results together which is not what matrix multiplication does.
Yes it is. That's exactly what multiplication does.
Yes, but this is not what is required in this case. We need to the result of two matrices multiplication, which is another matrix. See my answer below on how this can be done with 1d-convolution.
Matt J
Matt J on 13 Feb 2023
Edited: Matt J on 13 Feb 2023
We need to the result of two matrices multiplication, which is another matrix.
Yes, that is what my suggestion gives you. There is no limit in what I propose to the shape or size of the input and output.
More generally, there is no linear transform that can't be implemented using conv layers in combination with reshape() and permute() functionLayers. The only thing that is lacking is a clear understanding of where you want the transformation data to be re-used, if at all. My current understanding is that you want it to be re-used channel-wise. In other words, all channels are to be subject to a common linear transform.
Are you trying to transform each channel with the same matrix or avoid it? The scheme I outlined for you applies a different matrix to each channel, but if you want to apply the same matrix, you can reshape the input X (assume dimensions are HxWxC) so that the channel dimension becomes spatial,
X=reshape(X,[H*W,C])
and then apply a conv2dLayer with an (H*W)x1xN filter with no padding. It has the exact same effect. as multiplying each column (i.e., channel) of X with an Nx(H*W) matrix.
Yes - same matrix, but I need the outputs separate for each channel. This solution sums all channels together. The solution with grouped convolution has, as you said, different filters for each channel.
PS I guess N above is the number of outputs.
Matt J
Matt J on 13 Feb 2023
Edited: Matt J on 13 Feb 2023
This solution sums all channels together.
No, it won't. (Keep in mind that this is the 3rd solution I've proposed as information about your aims has come out). After the reshaping, each channel is contained in its own column of X. And, because the filter you apply to X is (H*W)x1xN there is no way for the filter to combine elements from different columns.
Got it. This looks nice.

Sign in to comment.

It is possible to perform matrix multiplication using convolution as described in "Fast algorithms for matrix multiplication using pseudo-number-theoretic transforms" (behind paywall):
  1. Converting the matrix A to a sequence
  2. Converting the matrix B to a sparse sequence
  3. Performing 1d convolution between the two sequences to obtain sequence
  4. Extracting matrix C entries from
Unfortunately, the paper provides only equations for the square matrix case . I worked out the general case . The critical equations are:
For -sequence: , , polynome degree ,
For -sequence: , , polynome degree ,
For -sequence: , , polynome degree ,
Also, need to pay attention to the fact that Matlab requires polynome's coefficients in descending order.
This is how you do convolution with dlconv s.t. it matches conv:
a=(1:4)';
b=(5:10)';
dla=dlarray(a,'SCB');
weights=flipud(b); % dlconv uses reverse order of conv's weights
% filterSize-by-numChannels-by-numFilters array, where
% filterSize is the size of the 1-D filters,
% numChannels is the number of channels of the input data,
% and numFilters is the number of filters.
bias = 0;
dlc=dlconv(dla,weights,bias,'Padding',length(weights)-1);
c = extractdata(dlc);
assert(all(abs(c-conv(a,b)) < 1e-14),'conv is different from dlconv');
Hope this helps.
Another possible way to interpret your question is that you are trying to apply pagemtimes to the input X with a non-learnable matrix A, where the different channels of X are the pages. That can also be done with a functionLayer, as illustrated below both with normal arrays and with dlarrays,
A=rand(4,3); %non-learnable matrix A
xdata=rand(3,3,2); %input layer data with 2 channels
multLayer=functionLayer(@(X) dlarray( pagemtimes(A,stripdims(X)) ,dims(X)) );
X=dlarray(xdata,'SSC');
Y=multLayer.predict(X)
Y =
4(S) × 3(S) × 2(C) dlarray (:,:,1) = 0.8480 0.9729 0.9338 1.1000 1.6463 1.5592 0.9130 1.2452 1.1881 1.1243 1.3362 1.2971 (:,:,2) = 0.8228 0.5187 1.1387 1.1783 0.7549 1.5675 0.9390 0.5816 1.2925 1.0862 0.6101 1.6128
%%Verify agreement with normal pagemtimes
ydata=pagemtimes(A,xdata)
ydata =
ydata(:,:,1) = 0.8480 0.9729 0.9338 1.1000 1.6463 1.5592 0.9130 1.2452 1.1881 1.1243 1.3362 1.2971 ydata(:,:,2) = 0.8228 0.5187 1.1387 1.1783 0.7549 1.5675 0.9390 0.5816 1.2925 1.0862 0.6101 1.6128

3 Comments

Except the matrix should be learnable :(. Basically, that's what Linear does.
The modification for the case where A is learnable is as below. I am using a pre-declared A here only so that I can demonstrate and test the response. In a real scenario, you wouldn't supply weights to the convolution2dLayer.
X=dlarray(rand(3,3,2),'SSC'); A=rand(4,3);
[h,w,c]=size(X);
L1=functionLayer( @(z) z(:,:) );
Lconv=convolution2dLayer([h,1],4,'Weights',permute(A,[2,3,4,1]));
L2=functionLayer(@(z)recoverShape(z,w,c) ,'Formattable',1);
net=dlnetwork([L1,Lconv,L2],X);
Yfinal=net.predict(X)
Yfinal =
4(S) × 3(S) × 2(C) single dlarray (:,:,1) = 1.1104 0.9300 0.6060 1.2600 0.9268 0.8284 0.9742 0.8960 0.7068 1.5047 1.0413 0.8057 (:,:,2) = 0.4512 0.3565 0.2455 0.6190 0.5052 0.5721 0.2262 0.4928 0.3368 0.8938 0.4187 0.5370
And as before, we can compare to the result of a plain-vanilla pagemtimes operation and see that it gives the same result:
Ycheck=pagemtimes(A, extractdata(X))
Ycheck =
Ycheck(:,:,1) = 1.1104 0.9300 0.6060 1.2600 0.9268 0.8284 0.9742 0.8960 0.7068 1.5047 1.0413 0.8057 Ycheck(:,:,2) = 0.4512 0.3565 0.2455 0.6190 0.5052 0.5721 0.2262 0.4928 0.3368 0.8938 0.4187 0.5370
function out=recoverShape(z,w,c)
z=permute( stripdims(z), [3,2,1]);
out=dlarray(reshape(z,[],w,c),'SSC');
end
Very nice! Just need to add the batch dimension.
I'd suggest to put this in a separate answer s.t. I can accept it.
PS Too bad it's not available in Matlab as a built-in.

Sign in to comment.

Another approach is to write your own custom layer for channel-wise matrix multiplication. I have attached a possible version of this,
X=rand(3,3,2);
L=pagemtimesLayer(4); %Custom layer - premultiplies channels by 4-row learnable matrix A
L=initialize(L, X);
Ypred=L.predict(X)
Ypred =
4(S) × 3(S) × 2(C) dlarray (:,:,1) = 0.6102 0.3216 0.8590 0.8080 0.5732 1.3988 0.2763 0.1120 0.2556 0.5463 0.8053 1.2450 (:,:,2) = 0.6860 0.9692 0.6784 1.1580 1.5767 1.1105 0.1999 0.2773 0.1199 1.1686 1.3306 0.5205
Ycheck=pagemtimes(L.A,X) %Check agreement with a direct call to pagemtimes()
Ycheck =
Ycheck(:,:,1) = 0.6102 0.3216 0.8590 0.8080 0.5732 1.3988 0.2763 0.1120 0.2556 0.5463 0.8053 1.2450 Ycheck(:,:,2) = 0.6860 0.9692 0.6784 1.1580 1.5767 1.1105 0.1999 0.2773 0.1199 1.1686 1.3306 0.5205

8 Comments

Thx!
PS BTW, I think that the PyTorch's Lienar is much cleaner:
, where x is , where means any number of dimensions including none, y is where all but the last dimension are the same shape as x, A is a learnable matrix and b is a learnable vector of size.
With this a user can decide how to shape her input to match the task at hand.
Most of that you can do by using tensorprod instead of pagemtimes. I leave it to you to tweak the design to your liking.
Except it doesn't support dlarrays :(
That is unfortunate. However, adding support for dlarrays wouldn't be terribly onerous. Instead of relying on automatic differentiation, you would have to add a backward() method to the class,
function Z = predict(layer, X)
if isa(X,'dlarray')
X=extractdata(X);
end
Z = tensorprod(layer.A,X,layer.innerdim) + layer.b;
end
function [dLdX,dLdA,dLdb] = backward(layer,X,Z,dLdZ)
dZdX=layer.A;
dZdA=X; %X should really be transposed here in some tensorial sense.
%Or, the tensorprod dimensions below should be adjusted.
dLdX=tensorprod(dLdZ,dZdX,___);
dLdA=tensorprod(dLdZ,dZdA,___);
dLdb=dLdZ;
....
end
which should be easy for a simple linear operation. See also,
For the case where b is a scalar, shouldn't dLdB=sum(dLdZ,"all"); ? (which is actually a shortcut for dLdB=tensorprod(ones(size(Z)),dLdZ,"all");)
If your bias is a scalar, uniformly applied to all Z(i), then yes. But if you have a different bias for every Z(i) then, extrapolating from the example below, dZdb should be an identity operator, meaning that dLdb=dLdZ*dZdb=dLdZ. Granted, I haven't tested any of what I'm outlining.
syms x b [4,1]; syms A [4,4]
Z=A*x+b
Z = 
dZdb = jacobian(Z,b)
dZdb = 
Presumably X (and Z) has a batch dimension. So, dLdZ should be at least summed over that dimension. Summing over the other dimensions depend on the shape of the bias: if it's a scalar, then all other dimensions should be summed over, if it doesn't depend on the channel dimension, then the channel dimension should also be summed over, etc.
That sounds right.
Although, part of me questions whether it was the best design for TMW to make the the user responsible for summing over batched input in the backward() method, since that dimension should always be handled the same way.

Sign in to comment.

Products

Release

R2022b

Asked:

on 11 Feb 2023

Edited:

on 15 Feb 2023

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!