crepePreprocess

Preprocess audio for CREPE deep learning network

Syntax

frames = crepePreprocess(audioIn,fs)

frames = crepePreprocess(audioIn,fs,'OverlapPercentage',OP)

[frames,loc] = crepePreprocess(___)

Description

frames = crepePreprocess(audioIn,fs) generates frames from audioIn that can be fed to the CREPE pretrained deep learning network.

frames = crepePreprocess(audioIn,fs,'OverlapPercentage',OP) specifies the overlap percentage between consecutive audio frames.

For example, frames = crepePreprocess(audioIn,fs,'OverlapPercentage',75) applies a 75% overlap between consecutive frames used to generate the processed frames.

[frames,loc] = crepePreprocess(___) returns the time values, loc, associated with each frame.

example

Examples

collapse all

Estimate Pitch Using CREPE Network

This example uses:

Open Live Script

The CREPE network requires you to preprocess your audio signals to generate buffered, overlapped, and normalized audio frames that can be used as input to the network. This example walks through audio preprocessing using crepePreprocess and audio postprocessing with pitch estimation using crepePostprocess. The pitchnn function performs these steps for you.

Read in an audio signal for pitch estimation. Visualize and listen to the audio. There are nine vocal utterances in the audio clip.

[audioIn,fs] = audioread('SingingAMajor-16-mono-18secs.ogg');
soundsc(audioIn,fs)
T = 1/fs;
t = 0:T:(length(audioIn)*T) - T;
plot(t,audioIn);
grid on
axis tight
xlabel('Time (s)')
ylabel('Ampltiude')
title('Singing in A Major')

Use crepePreprocess to partition the audio into frames of 1024 samples with an 85% overlap between consecutive mel spectrograms. Place the frames along the fourth dimension.

[frames,loc] = crepePreprocess(audioIn,fs);

Create a CREPE network with ModelCapacity set to tiny.

netTiny = audioPretrainedNetwork("crepe",ModelCapacity="tiny");

Predict the network activations.

activationsTiny = predict(netTiny,frames);

Use crepePostprocess to produce the fundamental frequency pitch estimation in Hz. Disable confidence thresholding by setting ConfidenceThreshold to 0.

f0Tiny = crepePostprocess(activationsTiny,ConfidenceThreshold=0);

Visualize the pitch estimation over time.

plot(loc,f0Tiny)
grid on
axis tight
xlabel('Time (s)')
ylabel('Pitch Estimation (Hz)')
title('CREPE Network Frequency Estimate - Thresholding Disabled')

With confidence thresholding disabled, crepePostprocess provides a pitch estimate for every frame. Increase the ConfidenceThreshold to 0.8.

f0Tiny = crepePostprocess(activationsTiny,ConfidenceThreshold=0.8);

Visualize the pitch estimation over time.

plot(loc,f0Tiny,LineWidth=3)
grid on
axis tight
xlabel('Time (s)')
ylabel('Pitch Estimation (Hz)')
title('CREPE Network Frequency Estimate - Thresholding Enabled')

Create a new CREPE network with ModelCapacity set to full.

netFull = audioPretrainedNetwork("crepe",ModelCapacity="full");

Predict the network activations.

activationsFull = predict(netFull,frames);
f0Full = crepePostprocess(activationsFull,ConfidenceThreshold=0.8);

Visualize the pitch estimation. There are nine primary pitch estimation groupings, each group corresponding with one of the nine vocal utterances.

plot(loc,f0Full,LineWidth=3)
grid on
xlabel('Time (s)')
ylabel('Pitch Estimation (Hz)')
title('CREPE Network Frequency Estimate - Full')

Find the time elements corresponding to the last vocal utterance.

roundedLocVec = round(loc,2);
lastUtteranceBegin = find(roundedLocVec == 16);
lastUtteranceEnd = find(roundedLocVec == 18);

For simplicity, take the most frequently occurring pitch estimate within the utterance group as the fundamental frequency estimate for that timespan. Generate a pure tone with a frequency matching the pitch estimate for the last vocal utterance.

lastUtteranceEstimation = mode(f0Full(lastUtteranceBegin:lastUtteranceEnd))

The value for lastUtteranceEstimate of 217.3 Hz. corresponds to the note A3. Overlay the synthesized tone on the last vocal utterance to audibly compare the two.

lastVocalUtterance = audioIn(fs*16:fs*18);
newTime = 0:T:2;
compareTone = cos(2*pi*lastUtteranceEstimation*newTime).';

soundsc(lastVocalUtterance + compareTone,fs);

Call spectrogram to more closely inspect the frequency content of the singing. Use a frame size of 250 samples and an overlap of 225 samples or 90%. Use 4096 DFT points for the transform. The spectrogram reveals that the vocal recording is actually a set of complex harmonic tones composed of multiple frequencies.

spectrogram(audioIn,250,225,4096,fs,'yaxis')

Input Arguments

collapse all

`audioIn` — Input signal
column vector | matrix

Input signal, specified as a column vector or matrix. If you specify a matrix, crepePreprocess treats the columns of the matrix as individual audio channels.

Data Types: single | double

`fs` — Sample rate (Hz)
positive scalar

Sample rate of the input signal in Hz, specified as a positive scalar.

Data Types: single | double

`OP` — Overlap percentage between consecutive audio frames
`85` (default) | nonnegative scalar in the range [0,100)

Percentage overlap between consecutive audio frames, specified as the comma-separated pair consisting of 'OverlapPercentage' and a scalar in the range [0,100).

Data Types: single | double

Output Arguments

collapse all

`frames` — Audio frames that can be fed to CREPE pretrained network
`1024`-by-`1`-by-`1`-by-N array

Processed audio frames, returned as a 1024-by-1-by-1-by-N array, where N is the number of generated frames.

Note

For multichannel inputs, generated frames are stacked along the 4th dimension according to channel. For example, if audioIn is a stereo signal, the number of generated frames for each channel is actually N/2. The first N/2 frames correspond to channel 1 and the subsequent N/2 frames correspond to channel 2.

Data Types: single | double

`loc` — Time values
`1`-by-N vector

Time values associated with each frame, returned as a 1-by-N vector, where N is the number of generated frames. The time values correspond to the most recent samples used to compute the frames.

Data Types: single | double

References

[1] Kim, Jong Wook, Justin Salamon, Peter Li, and Juan Pablo Bello. “Crepe: A Convolutional Representation for Pitch Estimation.” In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 161–65. Calgary, AB: IEEE, 2018. https://doi.org/10.1109/ICASSP.2018.8461329.

Extended Capabilities

expand all

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

This function fully supports GPU arrays. For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).

Version History

Introduced in R2021a

crepePreprocess

Syntax

Description

Examples

Estimate Pitch Using CREPE Network

Input Arguments

audioIn — Input signal column vector | matrix

fs — Sample rate (Hz) positive scalar

OP — Overlap percentage between consecutive audio frames 85 (default) | nonnegative scalar in the range [0,100)

Output Arguments

frames — Audio frames that can be fed to CREPE pretrained network 1024-by-1-by-1-by-N array

loc — Time values 1-by-N vector

References

Extended Capabilities

C/C++ Code Generation Generate C and C++ code using MATLAB® Coder™.

GPU Arrays Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

See Also

`audioIn` — Input signal
column vector | matrix

`fs` — Sample rate (Hz)
positive scalar

`OP` — Overlap percentage between consecutive audio frames
`85` (default) | nonnegative scalar in the range [0,100)

`frames` — Audio frames that can be fed to CREPE pretrained network
`1024`-by-`1`-by-`1`-by-N array

`loc` — Time values
`1`-by-N vector

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.