Main Content

vadnetPostprocess

Postprocess frame-based VAD probabilities

Since R2023a

    Description

    roi = vadnetPostprocess(audioIn,fs,netOut) postprocesses the speech probabilities output by a voice activity detection (VAD) network and returns indices corresponding to the beginning and end of speech within the audio signal.

    example

    roi = vadnetPostprocess(___,Name=Value) specifies options using one or more name-value arguments. For example, vadnetPostprocess(audioIn,fs,MergeThreshold=0.5) merges speech regions that are separated by 0.5 seconds or less.

    example

    [roi,probs] = vadnetPostprocess(___) also returns the probability of voice activity per sample in the input audio signal.

    example

    vadnetPostprocess(___) with no output arguments plots the input signal and the detected speech regions.

    example

    Examples

    collapse all

    Read in an audio signal containing speech and music and listen to the sound.

    [audioIn,fs] = audioread("MusicAndSpeech-16-mono-14secs.ogg");
    sound(audioIn,fs)

    Use vadnetPreprocess to preprocess the audio by computing a mel spectrogram.

    features = vadnetPreprocess(audioIn,fs);

    Call audioPretrainedNetwork to obtain a pretrained VAD neural network.

    net = audioPretrainedNetwork("vadnet");

    Pass the preprocessed audio through the network to obtain the probability of speech in each frame.

    probs = predict(net,features);

    Use vadnetPosprocess to postprocess the network output and determine the boundaries of the speech regions in the signal.

    roi = vadnetPostprocess(audioIn,fs,probs)
    roi = 2×2
    
               1       63120
           83600      150000
    
    

    Plot the audio with the detected speech regions.

    vadnetPostprocess(audioIn,fs,probs)

    Figure contains an axes object. The axes object with title Detected Speech, xlabel Time (s), ylabel Amplitude contains 8 objects of type line, constantline, patch.

    Read in an audio signal containing speech and music and listen to the sound.

    [audioIn,fs] = audioread("MusicAndSpeech-16-mono-14secs.ogg");
    sound(audioIn,fs)

    Preprocess the audio and pass it through the pretrained VADNet model.

    features = vadnetPreprocess(audioIn,fs);
    net = audioPretrainedNetwork("vadnet");
    probs = predict(net,features);

    Call vadnetPostprocess with the merge threshold set to 1 to merge detected speech regions that are separated by 1 second or less.

    vadnetPostprocess(audioIn,fs,probs,MergeThreshold=1)

    Figure contains an axes object. The axes object with title Detected Speech, xlabel Time (s), ylabel Amplitude contains 8 objects of type line, constantline, patch.

    Read in an audio signal containing speech and music.

    [audioIn,fs] = audioread("MusicAndSpeech-16-mono-14secs.ogg");

    Preprocess the audio and pass it through the pretrained VADNet model.

    features = vadnetPreprocess(audioIn,fs);
    net = audioPretrainedNetwork("vadnet");
    out = predict(net,features);

    Call vadnetPostprocess with an additional output variable to get the probabilities of speech in each sample of the signal.

    [roi,probs] = vadnetPostprocess(audioIn,fs,out);

    Plot the audio signal along with the voice activity probability.

    t = (0:length(audioIn)-1)/fs;
    plot(t,audioIn,t,probs,"r")
    legend("Audio signal","Probability of speech",Location="best")
    xlabel("Time (s)")
    title("Voice Activity Probability")

    Figure contains an axes object. The axes object with title Voice Activity Probability, xlabel Time (s) contains 2 objects of type line. These objects represent Audio signal, Probability of speech.

    Input Arguments

    collapse all

    Audio input signal, specified as a column vector (single channel).

    Data Types: single | double

    Sample rate in Hz, specified as a positive scalar.

    Data Types: single | double

    VAD network output, specified as a vector representing the probabilities of speech in each audio frame.

    Data Types: single | double

    Name-Value Arguments

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

    Example: vadnetPostprocess(audioIn,fs,netOut,ApplyEnergyVAD=true)

    Merge threshold in seconds, specified as a nonnegative scalar. The function merges speech regions that are separated by a duration less than or equal to the specified threshold. Set the threshold to Inf to not merge any detected regions.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

    Length threshold in seconds, specified as a nonnegative scalar. The function does not return speech regions that have a duration less than or equal to the specified threshold.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

    Probability threshold to start a speech segment, specified as a scalar in the range [0, 1].

    Data Types: single | double

    Probability threshold to end a speech segment, specified as a scalar in the range [0, 1].

    Data Types: single | double

    Apply energy-based VAD to the speech regions detected by the neural network, specified as true or false.

    Data Types: logical

    Output Arguments

    collapse all

    Speech regions, returned as an N-by-2 matrix of indices into the input signal, where N is the number of individual speech regions detected. The first column contains the index of the start of a speech region, and the second column contains the index of the end of a region.

    Probability of speech per sample of the input audio signal, returned as a column vector with the same size as the input signal.

    Algorithms

    The vadnetPostprocess function postprocesses the VAD network output using the following steps.

    1. Apply activation and deactivation thresholds to posterior probabilities to determine candidate speech regions.

    2. Optionally, apply energy-based VAD to refine the detected speech regions.

    3. Merge speech regions that are close to each other according to the merge threshold.

    4. Remove speech regions that are shorter than or equal to the length threshold.

    Extended Capabilities

    C/C++ Code Generation
    Generate C and C++ code using MATLAB® Coder™.

    Version History

    Introduced in R2023a

    expand all