how to mask specific bits in a signed fixed point number?

Question

Priscilla Allwin on 11 Apr 2022

0
Link

Direct link to this question

https://ch.mathworks.com/matlabcentral/answers/1693865-how-to-mask-specific-bits-in-a-signed-fixed-point-number

Commented: Priscilla Allwin on 14 Apr 2022

I have been trying to emulate a simple multiplier with fixed point inputs and output. I would like to mask the last 4 bits of the inputs and test the output. I tried using bitand() funtion, but it only accepts integer values. What can i do in the case of fixed point decimal values?

for example:

a = -2.345 and b = 0.2755 (with 16-b fixed point quantization)

c = a * b; (output also quantized to 16 bits)

I want to mask the last 4 bits of a and b, observe the output c . What function should i use?

Thanks!

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Andy Bartlett on 11 Apr 2022

0
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/1693865-how-to-mask-specific-bits-in-a-signed-fixed-point-number#answer_940505

Edited: Andy Bartlett on 11 Apr 2022

Open in MATLAB Online

Full-precision multiply

A full-precision multiplication of the 16 bit inputs can be done like so

format compact
format long
fp = fipref;
fp.NumericTypeDisplay = 'short';
% Set a and b using signed 16 bits 
%    assuming a and b are constants
%       set best-precision scaling based on the values
%
isSigned = 1;
nBits = 16;
a = fi(-2.3451220703125,isSigned,nBits)
disp(a.bin)
b = fi( 0.2763544921875,isSigned,nBits)
disp(b.bin)
% Full precision multiply
yFullPrecProduct = a .* b
disp(yFullPrecProduct.bin)

which outputs

a = 
  -2.345092773437500
      numerictype(1,16,13)
1011010011110101
b = 
   0.276351928710938
      numerictype(1,16,16)
0100011010111111
yFullPrecProduct = 
  -0.648070910945535
      numerictype(1,32,29)
11101011010000110000000011001011 

Notice that the output of 16 bits times 16 bits is 32 bits.

Reduced precision output from multiply

Reducing the size of a fixed-point multiplication's output begs a critical question of which bits to keep in the reduced precision output and which bits to discard.

Depending on the answer, there will also be a questions about how to handle overflows or rounding or both. If the full-precision output is signed, you may also need to decide if you want the reduced precision output to remain signed or change to unsigned.

Here is an example of keeping the most significant 16 bits

ntc = numerictype(c);
nBitsY1 = 16;
nPrecisionBitsToDrop = 16;
fmSatFloor = fimath('RoundingMethod', 'Floor', ...
    'OverflowAction', 'Saturate');
nty1 = numerictype( ...
    ntc.SignednessBool,...
    nBitsY1, ...
    ntc.FractionLength - nPrecisionBitsToDrop);
y1 = fi(yFullPrecProduct,nty1,fmSatFloor);
y1 = removefimath(y1)
disp(y1.bin)

which outputs

y1 = 
  -0.648071289062500
      numerictype(1,16,13)
1110101101000011

Here is an example keeping the least significant 16 bits

ntc = numerictype(c);
nBitsY2 = 16;
nPrecisionBitsToDrop2 = 0;
fmSatFloor = fimath('RoundingMethod', 'Floor', ...
    'OverflowAction', 'Saturate');
fmWrapFloor = fimath('RoundingMethod', 'Floor', ...
    'OverflowAction', 'Wrap');
nty2 = numerictype( ...
    ntc.SignednessBool,...
    nBitsY2, ...
    ntc.FractionLength - nPrecisionBitsToDrop2);
y2sat = fi(yFullPrecProduct,nty2,fmSatFloor);
y2sat = removefimath(y2sat)
disp(y2sat.bin)
y2wrap = fi(yFullPrecProduct,nty2,fmWrapFloor);
y2wrap = removefimath(y2wrap)
disp(y2wrap.bin)

which outputs

y2sat = 
    -6.103515625000000e-05
      numerictype(1,16,29)
1000000000000000
y2wrap = 
     3.781169652938843e-07
      numerictype(1,16,29)
0000000011001011

Notice that two different outputs were computed.

One that handles overflow by saturating. In this case, it saturated to the most negative representable value of the final output type.

The other that handles overflow by wrapping which means just throwing away the dropped most significant bits and always keeping the lower significant bits verbatim.

Masking bits

Masking bits to force certain bits to be zero and/or certain bits to be ones can be done in C, MATLAB, and Simulink using bit-wise AND and bit-wise OR. Functions or Simulink blocks for bit set and bit clear can also be used.

In MATLAB, the functions bitand and bitor are available. When using these with fixed-point fi objects, both arguments must have identical types, so that requires a little bit of care.

This function provides an example of using bitand to force the n least significant bits of the input to be zero.

function y = bitClearLSB(u,nBits)
    %bitClearLSB clear the n least significant bits of input
    %
    % Usage:
    %   y = bitClearLSB(u,nBits)
    % Inputs
    %   u      is any fixed-point or integer variable
    %   nBits  a non-negative integer value (defaults to 1)    
    %
    % Copyright 2022 The MathWorks, Inc.
    %#codegen
    if nargin < 2
        nBits = 1;
    end
    assert(...
        numel(nBits)==1 && isequal(size(nBits),size(u)),...
        'nBits must be scalar or same size as u.')
    assert(...
        all((nBits >= 0) & (nBits == floor(nBits)) & isfinite(nBits)),...
        'nBits must be a non-negative integer value.')
    % Built-in integers will be handled using equivalent fi object
    %
    u1 = castIntToFi(u);
    assert(isfi(u1) && isfixed(u1), 'u must be integer or fixed-point.')
    ntu1 = numerictype(u1);
    % Create raw bit mask with all ones in bit positions to keep as is
    % and all zeros in bit positions to clear
    % Example for word length of 8 bits
    %    nBits   rawBitMask
    %      0     1111
    %      1     1110
    %      2     1100
    %      3     1000
    %      4     0000
    %
    wl = ntu1.WordLength;
    
    ntRawBits = numerictype(0,wl,0);
    rawBitMask = repmat( upperbound(ntRawBits), size(nBits) );
    rawBitMask(:) = bitsll(rawBitMask,nBits);
    % bitand for fi requires both types to be identical
    %   including fimath properties
    % so reinterpret bitMask
    %   then set fimath
    %
    bitMask = reinterpretcast(rawBitMask,ntu1);
    bitMask = setfimath(bitMask,fimath(u1));
    
    y1 = bitand(u1,bitMask);
    % if built-in integer cast back to that type
    %
    y = cast(y1,'like',y1);
end

Here is an example of applying that to a variable.

format compact
format long
fp = fipref;
fp.NumericTypeDisplay = 'short';
% Set a and b using signed 16 bits 
%    assuming a and b are constants
%       set best-precision scaling based on the values
%
isSigned = 1;
nBits = 16;
b = fi( 0.2763544921875,isSigned,nBits)
disp(b.bin)
% Clear 4 LSBs of b
%
nBitsClear = uint8(4);
b1 = bitClearLSB(b,nBitsClear)
disp(b1.bin)

which outputs

b = 
   0.276351928710938
      numerictype(1,16,16)
0100011010111111
b1 = 
   0.276123046875000
      numerictype(1,16,16)
        RoundingMethod: Nearest
        OverflowAction: Saturate
           ProductMode: FullPrecision
               SumMode: FullPrecision
0100011010110000

The generated C code for the bit clearing operation will be simple like the following

void myFunc(int16_T a, unsigned char nBitsClear, int16_T *y1)
{
  int16_T tmp_bit_mask;
  tmp_bit_mask = 65535 << nBitsClear;
  *y2 = a & tmp_bit_mask;
}

Hopefully, this example gives you enough of an idea to craft whatever bit masking operation you are seeking.

Then combing that with the multiplication examples above should allow you to figure out a solution to your overall problem.

Consider casting

Since your high level goal involved multiplication, bit masking might not be the simplest way to achieve your goal. If your goal is to get rid of a certain number of most significant bits or least significant bits, you might want to consider using casting.

Consider the example given above of keeping the most significant 16 bits of variable (that happend to be a multiplication product). That dropped the least significant 16 bits of the input. Mathematically, that is equivalent keeping the output 32 bits but using masking such that the least significant 16 bits are all zeros.

Downcasting to 16 bits can be easier to think about and model than doing the bit masking. A big benefit is that subsequent operations can be more efficient. For example, bit masking then doing a 32 bit by 32 bit multiplication producing a 64 bit ideal product is less efficient than downcasting to 16 bits, then doing a 16 bit by 16 bit multiplication that produces a 32 bit ideal product.

3 Comments
Show 1 older commentHide 1 older comment

Andy Bartlett on 12 Apr 2022

Here is why downcast to smaller types before the multiplication is better than multiplying in a bigger type were the least significant bits have been set to zero.

Microcontroller

Suppose you are targeting an ARM microcontroller for deployment of your embedded design.

An multiply instruction with a 32 bit output takes 1 or 2 clock cycles.

In contrast a multiply instruction with a 64 bit output takes 3 to 7 clock cycles.

So the 64 bit multiply is at best 50% slower. Depending on the specific instruction needed it could be 100% to 600% slower.

ARM reference

FPGA

Alternately, suppose your are targeting and FPGA with DSP48E math slice.

The DSP48E can do a full precision multiplication up to 25 bits by 18 bits.

So a 16 bit by 16 bit multiplication can fit in just one DSP48E slice and get the full speed advantages of that hardened and optimized circuit.

In contrast, a 32 by 32 bit multiplication would not fit in one DSP48E. More FPGA resources would be need to partition and coordinate the math. The pieces would need to be coordinated across multiple clock cycles thus being slower too.

Look at page 64 of this AMD Xilinx reference.

ASIC

For ASIC, the transistors needed for an n-bit by n-bit multiply is on the order of n^2 bits. So the 32 bit by 32 bit multiply would require about 4 times as many transistors.

These digital circuits are really analog underneath. The clock cycle needs to be slow enough for these analog circuits to stabilize to their "digital" high or low voltages before starting the next calculation. The more complicated the combinatorial circuit the longer the settling time. Think of a addition with carry having to propagate bit level carries across a 32 bit output vs a 64 bit output. So a slower clock rate is one way to handle the bigger multiply. An alternative is to break up pieces of the calculation and pipeline them. The smaller pieces will have a faster settling time, so the clock can be faster. But the latency will be longer because the full multiply operation must wait multiple clock cycles for the calculation to be fully done.

Priscilla Allwin on 14 Apr 2022

Got it! Thank you.

Sign in to comment.

how to mask specific bits in a signed fixed point number?

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

3 Comments
Show 1 older commentHide 1 older comment

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

how to mask specific bits in a signed fixed point number?

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

3 Comments Show 1 older commentHide 1 older comment

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

3 Comments
Show 1 older commentHide 1 older comment