Coding a Phase Vocoder for Tempo Changing and Pitch Shifting in Audio Applications

One of my mathematics lecturers once invited us to solve a problem that has fascinated me ever since. If you create a digital sound clip by recording your own voice, how can you transform that to mimic the celebrated Norwegian weather forecaster Vidar Theisen?


For those of us never having heard about dear Mr. Theisen, or even wasn’t born before he passed away, you can watch one of his daily NRK weather forecasts below.

It is pretty obvious that Mr. Theisen had a very distinct way of talking. A voice of slow tempo accompanied by a monotonic pitch. Nobody, at least at that time, would be unsure who you pretended to be if you slowly and flat-voiced said something about the weather at Spitsbergen.

The lecturer also happened to be my adviser, and his name is Prof. Hans Munthe-Kaas. Hans is a very good man, and he has opened many doors for students that has knocked on his door. He is on my list of personal relationships I could not have been without.

To get the wheels rolling in his students heads, Mr. Munthe-Kaas outlined several incorrect strategies you could try to begin with: to slow things down by a factor two, try to repeat the digital samples one-by-one so that the net duration will be doubled. Or even better, try to interpolate?

It turns out that both strategies actually works in terms of doubling the duration, but has the inevitable side effect of also shifting down the pitch of audio signals. It is analogous to slowing down the rotational speed of a record player: it obviously decreases the tempo of the sound, but the tone will also be darker. If you happen to own a record player, you are probably nodding affirmative at this point. In other words, we approach Mr. Theisen in terms of tempo, but not by tone. For pure sinusoidal waves, what we want is to extend the number of periods rather than increasing the length of one period.

Red leaf

Virtually, the tempo and pitch of a tune is inseparable. If you change the tempo, the pitch will also change. The figure on the right illustrates the concept: modifications can seemingly only be done along the green diagonal line.

Until the mid 60s, this was accepted as the truth. Then suddenly a paper disturbed the community; two bright guys named Flanagan and Golden solved the problem in the context of telecommunication. Thanks to it, we can separate the two parameters. Controlling the Theisen tempo independently of the tone is now easy. And the invention’s name? The Phase Vocoder.

According to the theory, and some years of trial and error for my part, if what you want is to change the tempo of a signal by a factor Pt/Qt, and shift the pitch by a factor Pp/Qp, all you have to do is the following:

  1. Transform the signal over to the time-frequency plane using the Short-Time Fourier Transform (STFT).
  2. Linearly interpolate the magnitude of the STFT by a factor Pt/Qt*Qp/Pp across the time dimension.
  3. Go over the phase of the STFT and make sure that the partial time derivative across the time dimension is preserved.
  4. Transform the modified STFT back into the time domain
  5. Re-sample the resulting waveform by a factor Qp/Pp to get the final waveform.

We definitely need some test data to try this out. I have recorded my own voice saying the following in Norwegian:

“Som vi ser har det blitt kuldegrader over omtrent hele landet”.

Translated to English this corresponds to something like: “As you can see, we now have temperatures below zero degrees across the whole country”. You can listen to the original sound clip in the player below.

This Matlab code implements the Theisen Transform following the step-by-step Phase Vocoder recipe above:

function y = theisen(x)
% Geir K. Nilsen, 2006

Ns = length(x);
Nw = 1024;  % Window width
hop = Nw/4; % Overlap width
Pt = 2; 
Qt = 5;
Pp = 5; 
Qp = 6;
r = Pt/Qt * Qp / Pp; % Tempo-pitch shift factor

win = hann(Nw, 'periodic');

Nf = floor((Ns + hop - Nw) / hop);

fframe = zeros(Nw,Nf);
pframe = zeros(Nw,Nf);

% Step 1: STFT
c = 1;
for i = 0:hop:((Nf-1)*hop);
    fframe(:,c) = 2/3*fft(x(1+i:i+Nw).*win');
    c = c + 1;

% Step 2 & 3: Linear interpolation & phase preservation
phase = angle(fframe(:,1)); 
c = 1;
x = [fframe zeros(size(fframe, 1),1)]; 
for i = 0:r:Nf-1;                        
    x1 = x(:,floor(i)+1);                
    x2 = x(:,floor(i)+2);
    scale = i - floor(i);
    mag = (1-scale)*abs(x1) + scale*(abs(x2)); 
    pframe(:,c) = mag .* exp(j*phase); 
    c = c + 1;
    phase_adv = angle(x2) - angle(x1); 
    % Accumulate the phase
    phase = phase + phase_adv;

% Step 4: synthesize frames to get back waveform. Known as the Inverse
% Short Time Fourier Transform.
c = 1;
Nf = size(pframe,2);
Ns = Nf*hop - hop + Nw;
y = zeros(1,Ns);

for i = 0:hop:((Nf-1)*hop);
    pframe(:,c) = real(ifft(pframe(:,c)));          
    y(1+i:i+Nw) = y(1+i:i+Nw) + pframe(:,c)'.*win'; 
    c = c + 1;

% Step 5: finally resample the waveform to adjust according
% to the pitch shift factor.
y = resample(y, Pp, Qp);

When applied to the original recording, the result is quite impressing. Listen to the modified sound clip below and judge yourself.

For a practical C# implementation, see my Github repository.