Pitch Tracking Using Multiple Pitch Estimations and HMM

Open Live Script

This example shows how to perform pitch tracking using multiple pitch estimations, octave and median smoothing, and a hidden Markov model (HMM).

Introduction

Pitch detection is a fundamental building block in speech processing, speech coding, and music information retrieval (MIR). In speech and speaker recognition, pitch is used as a feature in a machine learning system. For call centers, pitch is used to indicate the emotional state and gender of customers. In speech therapy, pitch is used to indicate and analyze pathologies and diagnose physical defects. In MIR, pitch is used to categorize music, for query-by-humming systems, and as a primary feature in song identification systems.

Pitch detection for clean speech is mostly considered a solved problem. Pitch detection with noise and multi-pitch tracking remain difficult problems. There are many algorithms that have been extensively reported on in the literature with known trade-offs between computational cost and robustness to different types of noise.

Usually, a pitch detection algorithm (PDA) estimates the pitch for a given time instant. The pitch estimate is then validated or corrected within a pitch tracking system. Pitch tracking systems enforce continuity of pitch estimates over time.

This example provides an example function, HelperPitchTracker, which implements a pitch tracking system. The example walks through the algorithm implemented by the HelperPitchTracker function.

Problem Summary

Load an audio file and corresponding reference pitch for the audio file. The reference pitch is reported every 10 ms and was determined as an average of several third-party algorithms on the clean speech file. Regions without voiced speech are represented as nan.

[x,fs] = audioread("Counting-16-44p1-mono-15secs.wav");
load TruePitch.mat truePitch

Use the pitch function to estimate the pitch of the audio over time.

[f0,locs] = pitch(x,fs);

Two metrics are commonly reported when defining pitch error: gross pitch error (GPE) and voicing decision error (VDE). Because the pitch algorithms in this example do not provide a voicing decision, only GPE is reported. In this example, GPE is calculated as the percent of pitch estimates outside $\pm 10$ of the reference pitch over the span of the voiced segments.

Calculate the GPE for regions of speech and plot the results. Listen to the clean audio signal.

isVoiced = ~isnan(truePitch);
f0(~isVoiced) = nan;

p = 0.1;
GPE = mean(abs(f0(isVoiced)-truePitch(isVoiced)) > truePitch(isVoiced).*p).*100;

t = (0:length(x)-1)/fs;
t0 = (locs-1)/fs;
sound(x,fs)

figure(1)
tiledlayout(2,1)
nexttile
plot(t,x)
ylabel("Amplitude")
title("Pitch Estimation of Clean Signal")

nexttile
plot(t0,[truePitch,f0])
legend("Reference","Estimate",Location="northwest")
ylabel("F0 (Hz)")
xlabel("Time (s)")
title("GPE = " + round(GPE,2) + " (%)")

Figure contains 2 axes objects. Axes object 1 with title Pitch Estimation of Clean Signal, ylabel Amplitude contains an object of type line. Axes object 2 with title GPE = 2.15 (%), xlabel Time (s), ylabel F0 (Hz) contains 2 objects of type line. These objects represent Reference, Estimate.

Mix the speech signal with noise at $- 5$ dB SNR.

Use the pitch function on the noisy audio to estimate the pitch over time. Calculate the GPE for regions of voiced speech and plot the results. Listen to the noisy audio signal.

desiredSNR = -5;
x = mixSNR(x,rand(size(x)),desiredSNR);

[f0,locs] = pitch(x,fs);
f0(~isVoiced) = nan;
GPE = mean(abs(f0(isVoiced) - truePitch(isVoiced)) > truePitch(isVoiced).*p).*100;

sound(x,fs)

figure(2)
tiledlayout(2,1)

nexttile
plot(t,x)
ylabel("Amplitude")
title("Pitch Estimation of Noisy Signal")

nexttile
plot(t0,[truePitch,f0])
legend("Reference","Estimate",Location="northwest")
ylabel("F0 (Hz)")
xlabel("Time (s)")
title("GPE = " + GPE + " (%)")

Figure contains 2 axes objects. Axes object 1 with title Pitch Estimation of Noisy Signal, ylabel Amplitude contains an object of type line. Axes object 2 with title GPE = 12.4464 (%), xlabel Time (s), ylabel F0 (Hz) contains 2 objects of type line. These objects represent Reference, Estimate.

This example shows how to improve the pitch estimation of noisy speech signals using multiple pitch candidate generation, octave-smoothing, median-smoothing, and an HMM.

The algorithm described in this example is implemented in the example function HelperPitchTracker. To learn about the HelperPitchTracker function, enter help HelperPitchTracker at the command line.

help HelperPitchTracker

 HelperPitchTracker Track the fundamental frequency of audio signal
    f0 = HelperPitchTracker(audioIn,fs) returns an estimate of the
    fundamental frequency contour for the audio input. Columns of the
    input are treated as individual channels. The HelperPitchTracker
    function uses multiple pitch detection algorithms to generate pitch
    candidates, and uses octave smoothing and a Hidden Markov Model to
    return an estimate of the fundamental frequency.
 
    f0 = HelperPitchTracker(...,'HopLength',HOPLENGTH) specifies the number
    of samples in each hop. The pitch estimate is updated every hop.
    Specify HOPLENGTH as a scalar integer. If unspecified, HOPLENGTH
    defaults to round(0.01*fs).
 
    f0 = HelperPitchTracker(...,'OctaveSmoothing',TF) specifies whether or
    not to apply octave smoothing. Specify as true or false. If
    unspecified, TF defaults to true.
 
    f0 = HelperPitchTracker(...,'EmissionMatrix',EMISSIONMATRIX) specifies
    the emission matrix used for the HMM during the forward pass. The
    default emission matrix was trained on the Pitch Tracking Database from
    Graz University of Technology. The database consists of 4720 speech
    segments with corresponding pitch trajectories derived from
    laryngograph signals. The emission matrix corresponds to the
    probability that a speaker leaves one pitch state to another, in the
    range [50, 400] Hz. Specify the emission matrix such that rows
    correspond to the current state, columns correspond to the possible
    future state, and the values of the matrix correspond to the
    probability of moving from the current state to the future state. If
    you specify your own emission matrix, specify its corresponding
    EMISSIONMATRIXRANGE. EMISSIONMATRIX must be a real N-by-N matrix of
    integers.
 
    f0 = HelperPitchTracker(...,'EmissionMatrixRange',EMISSIONMATRIXRANGE)
    specifies how the EMISSIONMATRIX corresponds to Hz. If unspecified,
    EMISSIONMATRIXRANGE defaults to 50:400.
 
    [f0,loc] = HelperPitchTracker(...) returns the locations associated
    with each pitch decision. The locations correspond to the ceiling of
    the center of the analysis frames.
 
    [f0,loc,hr] = HelperPitchTracker(...) returns the harmonic ratio
    associated with each pitch decision.
 
  See also pitch, voiceActivityDetector

Description of Pitch Tracking System

The graphic provides an overview of the pitch tracking system implemented in the example function. The following code walks through the internal workings of the HelperPitchTracker example function.

1. Generate Multiple Pitch Candidates

In the first stage of the pitch tracking system, you generate multiple pitch candidates using multiple pitch detection algorithms. The primary pitch candidates, which are generally more accurate, are generated using algorithms based on the Summation of Residual Harmonics (SRH) [2] algorithm and the Pitch Estimation Filter with Amplitude Compression (PEFAC) [3] algorithm.

Buffer the noisy input signal into overlapped frames, and then use audio.internal.pitch.SRH to generate 5 pitch candidates for each hop. Also return the relative confidence of each pitch candidate. Plot the results.

RANGE = [50,400];
HOPLENGTH = round(fs.*0.01);

% Buffer into required sizes
xBuff_SRH = buffer(x,round(0.025*fs),round(0.02*fs),"nodelay");

% Define pitch parameters
params_SRH = struct(Method="SRH", ...
    Range=RANGE, ...
    WindowLength=round(fs*0.06), ...
    OverlapLength=round(fs*0.06-HOPLENGTH), ...
    SampleRate=fs, ...
    NumChannels=size(x,2), ...
    SamplesPerChannel=size(x,1));
multiCandidate_params_SRH = struct(NumCandidates=5,MinPeakDistance=1);

% Get pitch estimate and confidence
[f0_SRH,conf_SRH] = audio.internal.pitch.SRH(xBuff_SRH,x,fs, ...
                                             params_SRH, ...
                                             multiCandidate_params_SRH);

figure(3)
tiledlayout(2,1)
nexttile
plot(t0,f0_SRH)
ylabel("F0 Candidates (Hz)")
title("Multiple Candidates from SRH Pitch Estimation")
nexttile
plot(t0,conf_SRH)
ylabel("Relative Confidence")
xlabel("Time (s)")

Figure contains 2 axes objects. Axes object 1 with title Multiple Candidates from SRH Pitch Estimation, ylabel F0 Candidates (Hz) contains 5 objects of type line. Axes object 2 with xlabel Time (s), ylabel Relative Confidence contains 5 objects of type line.

Generate an additional set of primary pitch candidates and associated confidence using the PEF algorithm. Generate backup candidates and associated confidences using the normalized correlation function (NCF) algorithm and cepstrum pitch determination (CEP) algorithm. Log only the most confident estimate from the backup candidates.

xBuff_PEF = buffer(x,round(0.06*fs),round(0.05*fs),"nodelay");
params_PEF = struct(Method="PEF", ...
    Range=RANGE, ...
    WindowLength=round(fs*0.06), ...
    OverlapLength=round(fs*0.06-HOPLENGTH), ...
    SampleRate=fs, ...
    NumChannels=size(x,2), ...
    SamplesPerChannel=size(x,1));
multiCandidate_params_PEF = struct(NumCandidates=5,MinPeakDistance=5);
[f0_PEF,conf_PEF] = audio.internal.pitch.PEF(xBuff_PEF,fs, ...
                                             params_PEF, ...
                                             multiCandidate_params_PEF);

xBuff_NCF = buffer(x,round(0.04*fs),round(0.03*fs),"nodelay");
xBuff_NCF = xBuff_NCF(:,2:end-1);
params_NCF = struct(Method="NCF", ...
    Range=RANGE, ...
    WindowLength=round(fs*0.04), ...
    OverlapLength=round(fs*0.04-HOPLENGTH), ...
    SampleRate=fs, ...
    NumChannels=size(x,2), ...
    SamplesPerChannel=size(x,1));
multiCandidate_params_NCF = struct(NumCandidates=5,MinPeakDistance=1);
f0_NCF = audio.internal.pitch.NCF(xBuff_NCF,fs, ...
                                  params_NCF, ...
                                  multiCandidate_params_NCF);

xBuff_CEP = buffer(x,round(0.04*fs),round(0.03*fs),"nodelay");
xBuff_CEP = xBuff_CEP(:,2:end-1);
params_CEP = struct(Method="CEP", ...
    Range=RANGE, ...
    WindowLength=round(fs*0.04), ...
    OverlapLength=round(fs*0.04-HOPLENGTH), ...
    SampleRate=fs, ...
    NumChannels=size(x,2), ...
    SamplesPerChannel=size(x,1));
multiCandidate_params_CEP = struct(NumCandidates=5,MinPeakDistance=1);
f0_CEP = audio.internal.pitch.CEP(xBuff_CEP,fs, ...
                                  params_CEP, ...
                                  multiCandidate_params_CEP);

BackupCandidates = [f0_NCF(:,1),f0_CEP(:,1)];

2. Determine Long-Term Median

The long-term median of the pitch candidates is used to reduce the number of pitch candidates. To calculate the long-term median pitch, first calculate the harmonic ratio. Pitch estimates are only valid in regions of voiced speech, where the harmonic ratio is high.

hr = harmonicRatio(xBuff_PEF,fs, ...
                  Window=hamming(size(xBuff_NCF,1),"periodic"), ...
                  OverlapLength=0);

figure(4)
tiledlayout(2,1)
nexttile
plot(t,x)
ylabel("Amplitude")
nexttile
plot(t0,hr)
ylabel("Harmonic Ratio")
xlabel("Time (s)")

Figure contains 2 axes objects. Axes object 1 with ylabel Amplitude contains an object of type line. Axes object 2 with xlabel Time (s), ylabel Harmonic Ratio contains an object of type line.

Use the harmonic ratio to threshold out regions that do not include voiced speech in the long-term median calculation. After determining the long-term median, calculate lower and upper bounds for pitch candidates. In this example, the lower and upper bounds were determined empirically as 2/3 and 4/3 the median pitch. Candidates outside of these bounds are penalized in the following stage.

idxToKeep = logical(movmedian(hr>((3/4)*max(hr)),3));
longTermMedian = median([f0_PEF(idxToKeep,1);f0_SRH(idxToKeep,1)]);
lower = max((2/3)*longTermMedian,RANGE(1));
upper = min((4/3)*longTermMedian,RANGE(2));

figure(5)
tiledlayout(1,1)
nexttile
plot(t0,[f0_PEF,f0_SRH])
hold on
plot(t0,longTermMedian.*ones(size(f0_PEF,1)),"r:",LineWidth=3)
plot(t0,upper.*ones(size(f0_PEF,1)),"r",LineWidth=2)
plot(t0,lower.*ones(size(f0_PEF,1)),"r",LineWidth=2)
hold off
xlabel("Time (s)")
ylabel("Frequency (Hz)")
title("Long Term Median")

Figure contains an axes object. The axes object with title Long Term Median, xlabel Time (s), ylabel Frequency (Hz) contains 4657 objects of type line.

3. Candidate Reduction

By default, candidates returned by the pitch detection algorithm are sorted in descending order of confidence. Decrease the confidence of any primary candidate outside the lower and upper bounds. Decrease the confidence by a factor of 10. Re-sort the candidates for both the PEF and SRH algorithms so they are in descending order of confidence. Concatenate the candidates, keeping only the two most confident candidates from each algorithm.

Plot the reduced candidates.

invalid = f0_PEF>lower | f0_PEF>upper;
conf_PEF(invalid) = conf_PEF(invalid)/10;
[conf_PEF,order] = sort(conf_PEF,2,"descend");
for i = 1:size(f0_PEF,1)
    f0_PEF(i,:) = f0_PEF(i,order(i,:));
end

invalid = f0_SRH>lower | f0_SRH>upper;
conf_SRH(invalid) = conf_SRH(invalid)/10;
[conf_SRH,order] = sort(conf_SRH,2,"descend");
for i = 1:size(f0_SRH,1)
    f0_SRH(i,:) = f0_SRH(i,order(i,:));
end

candidates = [f0_PEF(:,1:2),f0_SRH(:,1:2)];
confidence = [conf_PEF(:,1:2),conf_SRH(:,1:2)];

figure(6)
plot(t0,candidates)
xlabel("Time (s)")
ylabel("Frequency (Hz)")
title("Reduced Candidates")

Figure contains an axes object. The axes object with title Reduced Candidates, xlabel Time (s), ylabel Frequency (Hz) contains 4 objects of type line.

4. Make Distinctive

If two or more candidates are within a given 5 Hz span, set the candidates to their mean and sum their confidence.

span = 5;
confidenceFactor = 1;
for r = 1:size(candidates,1)
    for c = 1:size(candidates,2)
        tf = abs(candidates(r,c)-candidates(r,:)) < span;
        candidates(r,c) = mean(candidates(r,tf));
        confidence(r,c) = sum(confidence(r,tf))*confidenceFactor;
    end
end
candidates = max(min(candidates,400),50);

5. Forward Iteration of HMM with Octave Smoothing

Now that the candidates have been reduced, you can feed them into an HMM to enforce continuity constraints. Pitch contours are generally continuous for speech signals when analyzed in 10 ms hops. The probability of a pitch moving from one state to another across time is referred to as the emission probability. Emission probabilities can be encapsulated into a matrix which describes the probability of going from any pitch value in a set range to any other in a set range. The emission matrix used in this example was created using the Graz database. [1]

Load the emission matrix and associated range. Plot the probability density function (PDF) of a pitch in 150 Hz state.

load EmissionMatrix.mat emissionMatrix emissionMatrixRange

currentState = 150;

figure(7)
plot(emissionMatrixRange(1):emissionMatrixRange(2),emissionMatrix(currentState - emissionMatrixRange(1)-1,:))
title("Emission PDF for " + currentState + " Hz")
xlabel("Next Pitch Value (Hz)")
ylabel("Probability")

Figure contains an axes object. The axes object with title Emission PDF for 150 Hz, xlabel Next Pitch Value (Hz), ylabel Probability contains an object of type line.

The HMM used in this example combines the emission probabilities, which enforce continuity, and the relative confidence of the pitch. At each hop, the emission probabilities are combined with the relative confidence to create a confidence matrix. A best choice for each path is determined as the max of the confidence matrix. The HMM used in this example also assumes that only one path can be assigned to a given state (an assumption of the Viterbi algorithm).

In addition to the HMM, this example monitors for octave jumps relative to the short-term median of the pitch paths. If an octave jump is detected, then the backup pitch candidates are added as options for the HMM.

% Preallocation
numPaths = 4;
numHops = size(candidates,1);
logbook = zeros(numHops,3,numPaths);
suspectHops = zeros(numHops,1);

% Forward iteration with octave-smoothing
for hopNumber = 1:numHops
    nowCandidates = candidates(hopNumber,:);
    nowConfidence = confidence(hopNumber,:);
    
    % Remove octave jumps
    if hopNumber > 100
        numCandidates = numel(nowCandidates);
        
        % Weighted short term median
        medianWindowLength = 50;
        aTemp = logbook(max(hopNumber-min(hopNumber,medianWindowLength),1):hopNumber-1,1,:);
        shortTermMedian = median(aTemp(:));
        previousM = mean([longTermMedian,shortTermMedian]);
        lowerTight = max((4/3)*previousM,emissionMatrixRange(1));
        upperTight = min((2/3)*previousM,emissionMatrixRange(2));
        numCandidateOutside = sum([nowCandidates < lowerTight, nowCandidates > upperTight]);
        
        % If at least 1 candidate is outside the octave centered on the
        % short-term median, add the backup pitch candidates that were
        % generated by the normalized correlation function and cepstrum
        % pitch determination algorithms as potential candidates.
        if numCandidateOutside > 0
            % Apply the backup pitch estimators
            nowCandidates = [nowCandidates,BackupCandidates(hopNumber,:)];%#ok<AGROW>
            nowConfidence = [nowConfidence,repmat(mean(nowConfidence),1,2)];%#ok<AGROW>
            
            % Make distinctive
            span = 10;
            confidenceFactor = 1.2;
            for r = 1:size(nowCandidates,1)
                for c = 1:size(nowCandidates,2)
                    tf = abs(nowCandidates(r,c)-nowCandidates(r,:)) < span;
                    nowCandidates(r,c) = mean(nowCandidates(r,tf));
                    nowConfidence(r,c) = sum(nowConfidence(r,tf))*confidenceFactor;
                end
            end
        end
    end
    
    % Create confidence matrix
    confidenceMatrix = zeros(numel(nowCandidates),size(logbook,3));
    for pageIdx = 1:size(logbook,3)
        if hopNumber ~= 1
            pastPitch = floor(logbook(hopNumber-1,1,pageIdx)) - emissionMatrixRange(1) + 1;
        else
            pastPitch = nan;
        end
        for candidateNumber = 1:numel(nowCandidates)
            if hopNumber ~= 1
                % Get the current pitch and convert to an index in the range
                currentPitch = floor(nowCandidates(candidateNumber)) - emissionMatrixRange(1) + 1;
                confidenceMatrix(candidateNumber,pageIdx) = ...
                    emissionMatrix(currentPitch,pastPitch)*logbook(hopNumber-1,2,pageIdx)*nowConfidence(candidateNumber);
            else
                confidenceMatrix(candidateNumber,pageIdx) = 1;
            end
        end
    end
    
    % Assign an estimate for each path
    for pageIdx = 1:size(logbook,3)
        % Determine most confident transition from past to current pitch
        [~,idx] = max(confidenceMatrix(:));
        
        % Convert confidence matrix index to pitch and logbook page
        [chosenPitch,pastPitchIdx] = ind2sub([numel(nowCandidates),size(logbook,3)],idx);
        
        logbook(hopNumber,:,pageIdx) = ...
            [nowCandidates(chosenPitch), ...
            confidenceMatrix(chosenPitch,pastPitchIdx), ...
            pastPitchIdx];
        
        % Remove the chosen current pitch from the confidence matrix
        confidenceMatrix(chosenPitch,:) = NaN;
    end
    % Normalize confidence
    logbook(hopNumber,2,:) = logbook(hopNumber,2,:)/sum(logbook(hopNumber,2,:));
end

6. Traceback of HMM

Once the forward iteration of the HMM is complete, the final pitch contour is chosen as the most confident path. Walk backward through the log book to determine the pitch contour output by the HMM. Calculate the GPE and plot the new pitch contour and the known contour.

numHops = size(logbook,1);
keepLooking = true;
index = numHops + 1;

while keepLooking
    index = index - 1;
    if abs(max(logbook(index,2,:))-min(logbook(index,2,:)))~=0
        keepLooking = false;
    end
end

[~,bestPathIdx] = max(logbook(index,2,:));
bestIndices = zeros(numHops,1);
bestIndices(index) = bestPathIdx;

for ii = index:-1:2
    bestIndices(ii-1) = logbook(ii,3,bestIndices(ii));
end

bestIndices(bestIndices==0) = 1;
f0 = zeros(numHops,1);
for ii = (numHops):-1:2
    f0(ii) = logbook(ii,1,bestIndices(ii));
end

f0toPlot = f0;
f0toPlot(~isVoiced) = NaN;
GPE = mean( abs(f0toPlot(isVoiced) - truePitch(isVoiced)) > truePitch(isVoiced).*p).*100;
figure(8)
plot(t0,[truePitch,f0toPlot])
legend("Reference","Estimate")
ylabel("F0 (Hz)")
xlabel("Time (s)")
title("GPE = " + round(GPE,2) + " (%)")

Figure contains an axes object. The axes object with title GPE = 1.72 (%), xlabel Time (s), ylabel F0 (Hz) contains 2 objects of type line. These objects represent Reference, Estimate.

7. Moving Median Filter

As a final post-processing step, apply a moving median filter with a window length of three hops. Calculate the final GPE and plot the final pitch contour and the known contour.

f0 = movmedian(f0,3);
f0(~isVoiced) = NaN;

GPE = mean(abs(f0(isVoiced) - truePitch(isVoiced)) > truePitch(isVoiced).*p).*100;
figure(9)
plot(t0,[truePitch,f0])
legend("Reference","Estimate")
ylabel("F0 (Hz)")
xlabel("Time (s)")
title("GPE = " + round(GPE,2) + " (%)")

Figure contains an axes object. The axes object with title GPE = 0.86 (%), xlabel Time (s), ylabel F0 (Hz) contains 2 objects of type line. These objects represent Reference, Estimate.

Performance Evaluation

The HelperPitchTracker function uses an HMM to apply continuity constraints to pitch contours. The emission matrix of the HMM can be set directly. It is best to train the emission matrix on sound sources similar to the ones you want to track.

This example uses the Pitch Tracking Database from Graz University of Technology (PTDB-TUG) [4]. The data set consists of 20 English native speakers reading 2342 phonetically rich sentences from the TIMIT corpus. Download and extract the data set.

downloadFolder = matlab.internal.examples.downloadSupportFile("audio","ptdb-tug.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"ptdb-tug");

Create an audio datastore that points to the microphone recordings in the database. Set the label associated with each file to the location of the associated known pitch file. The dataset contains recordings of 10 female and 10 male speakers. Use subset to isolate the 10th female and male speakers. Train an emission matrix based on the reference pitch contours for both male and female speakers 1 through 9.

ads = audioDatastore([fullfile(dataset,"SPEECH DATA","FEMALE","MIC"),fullfile(dataset,"SPEECH DATA","MALE","MIC")], ...
                     IncludeSubfolders=true, ...
                     FileExtensions=".wav");
wavFileNames = ads.Files;
ads.Labels = replace(wavFileNames,["MIC","mic","wav"],["REF","ref","f0"]);

idxToRemove = contains(ads.Files,["F10","M10"]);
ads1 = subset(ads,idxToRemove);
ads9 = subset(ads,~idxToRemove);

Shuffle the audio datastores.

ads1 = shuffle(ads1);
ads9 = shuffle(ads9);

The emission matrix describes the probability of going from one pitch state to another. In the following step, you create an emission matrix based on speakers 1 through 9 for both male and female. The database stores reference pitch values, short-term energy, and additional information in the text files with files extension f0. The getReferencePitch function reads in the pitch values if the short-term energy is above a threshold. The threshold was determined empirically in listening tests. The HelperUpdateEmissionMatrix creates a 2-dimensional histogram based on the current pitch state and the next pitch state. After the histogram is created, it is normalized to create an emission matrix.

emissionMatrixRange = [50,400];
emissionMatrix = [];

for i = 1:numel(ads9.Files)
    x = getReferencePitch(ads9.Labels{i});
    emissionMatrix = HelperUpdateEmissionMatrix(x,emissionMatrixRange,emissionMatrix);
end
emissionMatrix = emissionMatrix + sqrt(eps);
emissionMatrix = emissionMatrix./norm(emissionMatrix);

Define different types of background noise: white, ambiance, engine, jet, and street. Resample them to 16 kHz to help speed up testing the database.

Define the SNR to test, in dB, as 10, 5, 0, -5, and -10.

noiseType = ["white","ambiance","engine","jet","street"];
numNoiseToTest = numel(noiseType);

desiredFs = 16e3;

whiteNoiseMaker = dsp.ColoredNoise(Color="white",SamplesPerFrame=40000,RandomStream="mt19937ar with seed",BoundedOutput=true);
noise{1} = whiteNoiseMaker();
[ambiance,ambianceFs] = audioread("Ambiance-16-44p1-mono-12secs.wav");
noise{2} = resample(ambiance,desiredFs,ambianceFs);
[engine,engineFs] = audioread("Engine-16-44p1-stereo-20sec.wav");
noise{3} = resample(engine,desiredFs,engineFs);
[jet,jetFs] = audioread("JetAirplane-16-11p025-mono-16secs.wav");
noise{4} = resample(jet,desiredFs,jetFs);
[street,streetFs] = audioread("MainStreetOne-16-16-mono-12secs.wav");
noise{5} = resample(street,desiredFs,streetFs);

snrToTest = [10,5,0,-5,-10];
numSNRtoTest = numel(snrToTest);

Run the pitch detection algorithm for each SNR and noise type for each file. Calculate the average GPE across speech files. This example compares performance with the popular pitch tracking algorithm: Sawtooth Waveform Inspired Pitch Estimator (SWIPE). A MATLAB® implementation of the algorithm can be found at [5]. To run this example without comparing to other algorithms, set compare to false. The following comparison takes around 15 minutes.

compare = true;
numFilesToTest = 20;
p = 0.1;
GPE_pitchTracker = zeros(numSNRtoTest,numNoiseToTest,numFilesToTest);
if compare
    GPE_swipe = GPE_pitchTracker;
end
for i = 1:numFilesToTest
    [cleanSpeech,info] = read(ads1);
    cleanSpeech = resample(cleanSpeech,desiredFs,info.SampleRate);
    
    truePitch = getReferencePitch(info.Label{:});
    isVoiced = truePitch~=0;
    truePitchInVoicedRegions = truePitch(isVoiced);
    
    for j = 1:numSNRtoTest
        for k = 1:numNoiseToTest
            noisySpeech = mixSNR(cleanSpeech,noise{k},snrToTest(j));
            f0 = HelperPitchTracker(noisySpeech,desiredFs,EmissionMatrix=emissionMatrix,EmissionMatrixRange=emissionMatrixRange);
            f0 = [0;f0]; % manual alignment with database.
            GPE_pitchTracker(j,k,i) = mean(abs(f0(isVoiced) - truePitchInVoicedRegions) > truePitchInVoicedRegions.*p).*100;
            
            if compare
                f0 = swipep(noisySpeech,desiredFs,[50,400],0.01);
                f0 = f0(3:end); % manual alignment with database.
                GPE_swipe(j,k,i) = mean(abs(f0(isVoiced) - truePitchInVoicedRegions) > truePitchInVoicedRegions.*p).*100;
            end
        end
    end
end
GPE_pitchTracker = mean(GPE_pitchTracker,3);

if compare
    GPE_swipe = mean(GPE_swipe,3);
end

Plot the gross pitch error for each noise type.

for ii = 1:numel(noise)
    figure
    plot(snrToTest,GPE_pitchTracker(:,ii),"b")
    hold on
    if compare
        plot(snrToTest,GPE_swipe(:,ii),"g")
    end
    plot(snrToTest,GPE_pitchTracker(:,ii),"bo")
    if compare
        plot(snrToTest,GPE_swipe(:,ii),"gv")
    end
    title(noiseType(ii))
    xlabel("SNR (dB)")
    ylabel("Gross Pitch Error (p = " + round(p,2) + " )")
    if compare
        legend("HelperPitchTracker","SWIPE")
    else
        legend("HelperPitchTracker")
    end
    grid on
    hold off
end

Figure contains an axes object. The axes object with title white, xlabel SNR (dB), ylabel Gross Pitch Error (p = 0.1 ) contains 4 objects of type line. One or more of the lines displays its values using only markers These objects represent HelperPitchTracker, SWIPE.

Figure contains an axes object. The axes object with title ambiance, xlabel SNR (dB), ylabel Gross Pitch Error (p = 0.1 ) contains 4 objects of type line. One or more of the lines displays its values using only markers These objects represent HelperPitchTracker, SWIPE.

Figure contains an axes object. The axes object with title engine, xlabel SNR (dB), ylabel Gross Pitch Error (p = 0.1 ) contains 4 objects of type line. One or more of the lines displays its values using only markers These objects represent HelperPitchTracker, SWIPE.

Figure contains an axes object. The axes object with title jet, xlabel SNR (dB), ylabel Gross Pitch Error (p = 0.1 ) contains 4 objects of type line. One or more of the lines displays its values using only markers These objects represent HelperPitchTracker, SWIPE.

Figure contains an axes object. The axes object with title street, xlabel SNR (dB), ylabel Gross Pitch Error (p = 0.1 ) contains 4 objects of type line. One or more of the lines displays its values using only markers These objects represent HelperPitchTracker, SWIPE.

Conclusion

You can use HelperPitchTracker as a baseline for evaluating GPE performance of your pitch tracking system, or adapt this example to your application.

References

[1] G. Pirker, M. Wohlmayr, S. Petrik, and F. Pernkopf, "A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario", Interspeech, pp. 1509-1512, 2011.

[2] Drugman, Thomas, and Abeer Alwan. "Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics." Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2011, pp. 1973-1976.

[3] Gonzalez, Sira, and Mike Brookes. "A Pitch Estimation Filter robust to high levels of noise (PEFAC)." 19th European Signal Processing Conference. Barcelona, 2011, pp. 451-455.

[4] Signal Processing and Speech Communication Laboratory. Accessed September 26, 2018. https://www.spsc.tugraz.at/databases-and-tools/ptdb-tug-pitch-tracking-database-from-graz-university-of-technology.html.

[5] "Arturo Camacho." Accessed September 26, 2018. https://www.cise.ufl.edu/~acamacho/english/.

[6] "Fxpefac." Description of Fxpefac. Accessed September 26, 2018. http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html.