Watch Demo
Back to Blog

Signalsmith Stretch: How Modern Pitch-Shifting Actually Works

How does a modern time stretch algorithm slow down music without changing pitch, or shift its key without changing speed? This article breaks down the Signalsmith Stretch pitch shift algorithm: what it does, how it differs from older approaches, and why it sounds better.

For a long time, PracticeSession used a time-domain algorithm called WSOLA for time-stretching and pitch-shifting. It was acceptable for practice scenarios, especially at moderate speeds, but I was never fully satisfied with it. Artifacts crept in at lower speeds, and I knew better alternatives existed. I just assumed they wouldn't be available as open-source, real-time libraries.

Then I added stem separation and multi-track mixing to PracticeSession, and everything broke. When I tried to time-stretch and pitch-shift multiple tracks simultaneously, the music turned into a jumbled, unusable mess. Whether it was computational overload or something deeper about running multiple WSOLA instances in parallel, I don't know. But it forced me to look for alternatives.

I found Signalsmith Stretch, a spectral algorithm created by Geraint Luff of Signalsmith Audio. When I first heard it process multiple tracks at once, all staying in perfect sync with no perceptible quality loss compared to a single track, I was floored. And the overall quality, multi-track aside, was just so much better. Even at extreme slow-downs like 0.1×, the sound had a flowing smoothness that WSOLA couldn't match. There was no turning back.

This article explores what Signalsmith Stretch does and why it sounds so different. It's the companion piece to our WSOLA deep-dive: where that article explains the time-domain approach, this one covers the spectral approach that replaced it.

The Limits of Waveform Matching

WSOLA chops audio into overlapping chunks and repositions them in time. To avoid artifacts at the overlaps, it searches for the best alignment: the position where the outgoing chunk's waveform most closely continues from the incoming one. For monophonic signals (a single voice, a solo instrument), the waveform has a clear repeating pattern, and this works well.

But WSOLA treats audio as a single waveform. It has no concept of the individual frequencies within the sound. When multiple notes ring simultaneously, like a piano chord or a full band mix, the combined waveform doesn't repeat in any simple way. The cross-correlation search can't find clean matches, and you hear warbling and beating as chunks interfere unpredictably.

To handle polyphonic material, we need an entirely different approach: one that separates the audio into its frequency components and processes each one independently.

Seeing Sound: The Spectrogram

Take a short segment of audio and apply a Fourier Transform (FFT). The FFT decomposes the segment into its constituent frequencies, telling you which ones are present and how strong each is. Do this for overlapping segments across the full audio, and you get a spectrogram: a 2D picture with time on the horizontal axis, frequency on the vertical axis, and brightness showing energy.

A spectrogram makes visible what your ear already knows. Sustained notes appear as horizontal lines. Drum hits appear as vertical lines (energy across all frequencies at one instant). Individual harmonics of a chord show up as clearly separated lines, even when the raw waveform looks like an incomprehensible squiggle. Where WSOLA sees one complex waveform it can't match, the spectrogram reveals the individual components that can each be handled on their own terms.

The formal name for this process is the Short-Time Fourier Transform (STFT), and the overlap-add framework it uses is the same one we covered in the WSOLA article. The segments are windowed (tapered at the edges for a cleaner spectrum), processed, then overlapped and summed back together. What changes is what happens between windowing and summing: instead of just repositioning the audio chunk, we transform it to the frequency domain, modify it, and transform it back.

The time-frequency tradeoff

There's a fundamental constraint. The FFT analyzes a fixed-length window of audio, and the length of that window forces a tradeoff:

This is a mathematical fact, closely related to Heisenberg's uncertainty principle. You cannot know both "exactly what frequency" and "exactly when" simultaneously. Every spectral algorithm must pick a compromise.

Window size (at 44.1 kHz) Duration Frequency resolution Best for
256 samples ~6 ms ~172 Hz Percussive material
1024 samples ~23 ms ~43 Hz General purpose
2048 samples ~46 ms ~21.5 Hz Pitched/harmonic content
4096 samples ~93 ms ~10.7 Hz Dense, slowly-varying spectra

At 1024 samples (~23 ms), you get ~43 Hz frequency resolution. Fine for telling apart notes a semitone or more apart in the mid-range, but a drum hit can't be pinpointed to better than 23 ms. Most time-stretching algorithms use windows in the 1024–4096 range.

The Phase Vocoder: Preserving Tone, Losing Time

The phase vocoder is the classic spectral time-stretcher, first described by Flanagan and Golden in 1966 and refined over decades of research.

The FFT gives us two values for each frequency bin:

To time-stretch, keep the magnitudes (what the audio sounds like) and redistribute the frames in time (when each spectrum appears). For a 2× stretch, space output frames twice as far apart. Simple enough.

The hard part is phase.

Why phase matters

Phase encodes timing information within each spectral frame. A drum hit at the beginning of a frame and one at the end have identical magnitudes but different phases. Get the phase wrong, and a sharp transient turns into a diffuse blob.

But the problem goes deeper. The relationships between phases of nearby frequency bins encode the temporal structure of the sound. A drum hit has a specific pattern of phase relationships across all frequencies: they're all aligned so that energy constructively interferes at one moment. Disrupt these relationships, and the energy spreads out. The sound becomes "echoey" or "phasey."

Two kinds of phase coherence

Much of the research in this field has focused on two types of phase relationships. Understanding the distinction is key to understanding both why the phase vocoder sounds the way it does, and how Signalsmith Stretch improves on it.

Horizontal coherence (across time): Each frequency bin's phase should evolve smoothly from one frame to the next. If a 440 Hz tone is playing, its phase should advance by the right amount between frames. The standard phase vocoder preserves this well. Think of it as: "each frequency continues smoothly through time."

Vertical coherence (across frequency): Phase relationships between adjacent bins within a single frame should be consistent. When a single sound event (a note, a transient) appears in multiple bins, their phases have a specific relationship that encodes when it happens. Think of it as: "all the frequencies making up one sound event are in sync."

The standard phase vocoder preserves horizontal coherence but destroys vertical coherence. Sustained notes come through, but transients become diffuse and everything acquires a "reverberant" quality. Laroche and Dolson, in their 1999 paper, described it as double vision for your ears: the audio equivalent of misaligned printing plates, where everything is slightly smeared and out of register.

Phase locking helped. Transients didn't.

Puckette (1995) and Laroche & Dolson (1999) developed phase locking: detect spectral peaks (the dominant notes), then force all nearby bins to maintain their original phase relationships. The difference is striking. Without phase locking, the output has the phase vocoder's "characteristic reverberant quality." With it, "the resynthesized sound had the same presence as the original."

But transients remained stubborn. The analysis window is 20–90 ms long; a drum hit lasts a few milliseconds. The energy gets spread across the window, and adjacent frames each reproduce the transient at slightly different times, creating pre-echo and post-echo. Where the original had a single sharp attack, the output has a blurred, anticipatory buildup.

Researchers developed several workarounds: phase resets at detected transients (Duxbury et al., 2002; Robel, 2003), non-linear time maps that avoid stretching across transients, and harmonic-percussive separation using median filtering. All add complexity. All have tradeoffs. Transients in spectral processing remain an active research problem.

The Spectrogram's Double Life

Before getting to Signalsmith Stretch, there's one more concept to cover, and it's the insight that ties everything together.

Look at a spectrogram. Each column is the spectrum of one analysis frame: a snapshot of which frequencies are present at that moment. Read it column-by-column, and you're doing spectral processing. Each column is an FFT, and you process one spectrum at a time.

Now turn your attention sideways. Each row is the time history of a single frequency bin: how the energy and phase at that specific frequency evolve over time. Each row is a sub-band, a very narrow bandpass filter isolating one frequency region.

Geraint Luff, in his ADC22 talk, makes the point explicitly: these are two perspectives on exactly the same data. The FFT calculations are identical. Only the interpretation changes.

This matters because it reframes what the phase vocoder is actually doing. Viewed row-by-row, each frequency bin is a narrow-band signal. If the band is narrow enough to contain only one frequency at a time, you can estimate that frequency from how fast the phase changes, stretch its amplitude and frequency envelopes, and resynthesize. That's sub-band resynthesis, and it's exactly what the phase vocoder does, just described from a different angle.

The problem also becomes more obvious from this vantage point. The phase vocoder processes each row independently. Each sub-band is stretched on its own, with no reference to what the neighboring bands are doing. The phases drift apart across bands. And that drift is phasiness: the sub-bands are no longer in sync.

Keep this dual perspective in mind. Signalsmith Stretch exploits it directly.

Signalsmith Stretch: The Hybrid

Luff positions his algorithm as a "fourth method," distinct from time-domain OLA, standard phase vocoders, and sinusoidal modeling. He builds up to it in the ADC22 talk by first showing the phase vocoder and its opposite, then combining them.

The phase vocoder's opposite

The standard phase vocoder makes horizontal phase predictions: it looks at each frequency bin's phase in the previous time frame and predicts what it should be in the next. Tonal continuity is preserved, but timing information (vertical coherence) is lost.

What happens if you predict phase the other way? Instead of looking back in time, look down in frequency: predict the phase from the adjacent bin one step below. A vertical phase prediction that exactly preserves the relative phase between adjacent frequencies.

On its own, the result is terrible for tonal content. Sustained notes become a detuned mess. But the drums are crystal clear. All timing information is perfectly preserved, because vertical phase coherence is what encodes when events happen.

Luff calls this the "evil twin": it preserves timing but mangles tonality. The exact mirror image of the phase vocoder, which preserves tonality but mangles timing.

Two methods, each failing in the opposite way. The natural question: what if you combine them?

Blending horizontal and vertical

Signalsmith Stretch does exactly that. For each frequency bin in each output frame:

  1. Read the energy from the appropriate time-and-frequency-mapped position in the input spectrum.
  2. Make a horizontal prediction (phase vocoder style): use the same bin's phase from the previous time frame, for tonal continuity.
  3. Make a vertical prediction (evil twin style): use the adjacent bin's phase from a lower frequency, for timing precision.
  4. Blend the two with a weighted average. The weights come from the energy of each prediction source.

The weighting is what makes it work. Strong tonal components, which have consistent energy over time, naturally favor the horizontal prediction. Transients, which have energy spread across frequencies at one moment, naturally favor the vertical. No explicit "detect transients" step is needed. The physics of the signal drives the blend.

Phase blending with complex numbers

See the math

How do you average two phase values?

Phases are circular. They wrap around at 360°. If one prediction says 10° and another says 350°, the arithmetic average gives 180°, which is completely wrong. The correct answer is 0° (or equivalently, 360°).

The fix is to work with complex numbers instead of raw phase angles. Each phase prediction becomes a complex number on the unit circle (a + bi), where the angle encodes phase and the magnitude encodes confidence.

Complex number averaging handles wrapping correctly by default. And the magnitude of each complex prediction acts as a natural weight: a strong tonal component produces a high-magnitude horizontal prediction that dominates the average. A transient produces a high-magnitude vertical prediction that dominates instead.

Choosing the right representation makes the hard problem dissolve. No branching, no special cases, no wrap-around handling. Just multiply and add complex numbers.

Luff takes this further. Rather than just two predictions (one horizontal, one vertical), Signalsmith Stretch blends predictions from multiple frequency offsets: bins at distances of 1, 2, 4, 8, and 16. This captures phase relationships at different scales, improving coherence for both narrowly spaced harmonics and broader spectral features.

Other Design Choices

The hybrid phase prediction is the core of Signalsmith Stretch, but several other design choices set it apart.

Simultaneous pitch and time processing

Most pitch-shifting libraries, including Rubber Band and SoundTouch, treat pitch-shifting as a two-step process: time-stretch, then resample to shift the pitch (or vice versa). Signalsmith Stretch applies frequency mapping and time mapping together during the spectral processing step. At larger pitch shifts (±6 semitones and beyond), chaining two operations compounds artifacts; doing both at once avoids this.

Non-linear frequency mapping

When pitch-shifting, the naive approach scales all frequencies uniformly. But uniform scaling leaks spectral energy into wrong bins, producing metallic artifacts (frequency-domain aliasing).

Signalsmith Stretch creates a non-linear frequency map that is locally 1:1 around each spectral peak. Strong harmonics simply move to their new target frequencies, and their spectral neighborhoods come along undistorted. Between peaks, where there's less energy and the ear is less sensitive, the mapping absorbs the necessary compression or expansion. Harmonics shift cleanly; aliasing artifacts drop substantially.

Tonality limit

When you pitch-shift audio up, high frequencies get pushed even higher. But high-frequency content like breath noise, cymbal shimmer, and consonant textures doesn't have a fixed "pitch" that should shift. Moving it higher makes the audio sound thin and unnatural.

Signalsmith Stretch has a configurable tonality limit: a corner frequency above which the pitch shift gradually fades to zero. Below it, harmonics shift normally. Above it, content stays roughly in place. High-frequency textures keep their natural character while the perceptually important lower harmonics shift as expected.

Stereo handling

For stereo audio, Signalsmith Stretch uses the loudest channel for phase prediction, then copies inter-channel phase differences from input to output. Panning, spatial width, and phase-based effects are preserved without a separate coherence algorithm.

The Landscape

A quick comparison with the other major libraries:

Library Approach License Pitch shifting Sweet spot
Elastique (zplane) Proprietary spectral Commercial SDK Resample + stretch Industry gold standard; used by Ableton, Cubase, FL Studio, Reaper, Pro Tools
Rubber Band Phase vocoder (R3: multi-band) GPL v2+ / Commercial Resample + stretch Best GPL option; used by Audacity, Ardour
Signalsmith Stretch Hybrid spectral (multi-directional phase) MIT Simultaneous spectral Best MIT-licensed option; 0.75×–1.5× time, multi-octave pitch
SoundTouch WSOLA + resampling LGPL v2.1 Resample + WSOLA Very fast; adequate for modest changes on pop/rock
Paulstretch Phase-randomized spectral GPL N/A Extreme stretching only (4×+); creates ambient textures

Licensing matters. Elastique is widely considered the best-sounding option, but it's a paid commercial SDK. Rubber Band is GPL-licensed, meaning you either open-source your project or buy a commercial license. Signalsmith Stretch is MIT-licensed: fully permissive, usable in any project. For indie and open-source projects, that's a significant practical advantage.

On quality, community consensus puts Signalsmith Stretch alongside Rubber Band's newer R3 engine, both approaching commercial-grade results, though the comparison depends on the material and parameters. Signalsmith Stretch may have an edge for pitch-shifting (simultaneous spectral approach), while Rubber Band R3's multi-band architecture may handle certain transient-heavy material better.

Honest Limitations

No algorithm is perfect. Signalsmith Stretch is transparent about its current weaknesses.

Transient handling at extreme stretches

For time stretches beyond 2×, the library uses what Luff himself calls "a hack": vertical phase scaling is capped at 2×, and phases are slightly randomized for longer stretches. This prevents rhythmic-echo artifacts but produces what Luff describes as a "juddery smudge." A proper fix (a non-linear time map, analogous to the non-linear frequency map) is listed as a planned improvement.

Best operating range

Time stretching sounds best between 0.75× and 1.5×. Pitch shifting handles multiple octaves with increasing artifacts at the extremes. Most competing algorithms have similar sweet spots; Signalsmith Stretch tends to degrade gradually rather than hitting sudden quality cliffs.

No built-in transient detection

Unlike Rubber Band R3, which includes explicit transient detection and per-frequency transience classification, Signalsmith Stretch relies entirely on energy-weighted phase blending for implicit transient handling. Simpler, and avoids false-positive problems, but there's no mechanism for special treatment like phase resets or local time-map adjustments.

What This Means for Practice

Does any of this matter when you're trying to learn a guitar solo?

At moderate slow-down ratios (70%–90% speed), most modern algorithms sound acceptable. The differences show up in specific situations:

A better algorithm won't make you practice more or learn faster on its own. But it removes friction: the audio artifacts that make slowed-down music sound "wrong." You stop noticing the tool and focus on the music.

Hear the Difference

PracticeSession uses Signalsmith Stretch for high-quality time-stretching and pitch-shifting. Slow down, transpose, loop, and practice any song with clarity.

Download Free Trial

Further Resources

For those who want to go deeper:

References

  1. Flanagan, J.L., & Golden, R.M. (1966). Phase Vocoder. Bell System Technical Journal, 45(9), 1493–1509.
  2. Portnoff, M.R. (1976). Implementation of the digital phase vocoder using the fast Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(3), 243–248.
  3. Dolson, M. (1986). The Phase Vocoder: A Tutorial. Computer Music Journal, 10(4), 14–27.
  4. Verhelst, W., & Roelands, M. (1993). An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech. IEEE ICASSP, Vol. 2, pp. 554–557.
  5. Puckette, M. (1995). Phase-locked vocoder. IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics.
  6. Laroche, J., & Dolson, M. (1999). Improved Phase Vocoder Time-Scale Modification of Audio. IEEE Transactions on Speech and Audio Processing, 7(3), 323–332.
  7. Duxbury, C., Davies, M., & Sandler, M. (2002). Improved Time-Scaling of Musical Audio using Phase Locking at Transients. AES Convention 112.
  8. Robel, A. (2003). A new approach to transient processing in the phase vocoder. International Computer Music Conference (ICMC).
  9. Driedger, J., & Muller, M. (2016). A Review of Time-Scale Modification of Music Signals. Applied Sciences, 6(2), 57.
  10. Luff, G. (2022). Four Ways to Write a Pitch-Shifter. Audio Developer Conference (ADC22). Video.
  11. Luff, G. (2023). The Design of Signalsmith Stretch. Blog post.