Watch Demo
Back to Blog

WSOLA Explained: How Audio Gets Slowed Down Without Changing Pitch

When you slow down a song in music practice software, the pitch stays the same. How? The answer is an algorithm called WSOLA (Waveform Similarity Overlap-Add). Despite being the backbone of most time-stretching software, clear explanations of how it actually works are surprisingly hard to find.

This article aims to fix that. We'll build up an intuitive understanding of WSOLA through interactive visualizations, starting with the fundamental problem it solves.

The Problem: Naive Resampling Changes Pitch

Digital audio is a sequence of numbers called samples, each representing the waveform's amplitude at a specific instant. At a standard sample rate of 44,100 Hz, one second of audio contains 44,100 samples.

The simplest way to slow down audio is to spread those samples apart, playing them more slowly. But this creates an obvious problem: the pitch drops.

Why? Sound is vibration. A musical note like A4 vibrates at 440 cycles per second (440 Hz). If you stretch the audio to twice its length, those 440 cycles now take 2 seconds to play. That's only 220 cycles per second, which is A3, a full octave lower.

Use the visualization below to see and hear this effect:

This is why simply adjusting playback speed in a basic media player makes music sound like chipmunks (sped up) or a slowed-down record (slowed down). To change tempo without changing pitch, we need a smarter approach.

The Solution: Overlap-Add (OLA)

Instead of stretching the samples themselves, what if we rearrange chunks of audio? This is the key insight behind Overlap-Add algorithms.

We extract overlapping "frames" from the original audio, then place them at new positions in the output. The original samples are preserved; they're just reorganized. Since we're not changing the samples themselves, the frequency content (and thus the pitch) stays the same.

In the output, frames are placed with exactly 50% overlap: each frame overlaps half of the previous frame and half of the next. This overlap allows adjacent frames to be blended together smoothly. We'll look at how this blending works shortly, but first: how do we control the stretch factor?

How Stretching and Compression Work

The key insight: the output format is fixed. We select frames of audio from the input, then rearrange them so that consecutive frames overlap by exactly 50% in the output. This fixed overlap is what allows smooth blending (more on that shortly).

So what determines how much we stretch or compress the audio? The spacing between frames in the input.

Play with the visualization below to see this in action:

What the Visualization Shows

Key Observations

To slow down: Input frames are placed closer together (more overlap). When spread to 50% overlap in the output, the audio takes longer to play.

To speed up: Input frames are placed further apart (less overlap). At extreme speeds, gaps appear in the input, and some audio is never read at all.

At normal speed: Input and output spacing are equal. Frames overlap 50% in both.

This reveals something important: time-stretching doesn't resample the audio. The original samples are preserved; they're just rearranged. That's why the pitch doesn't change.

Try the "Fixed window" display mode to see this more clearly. In this mode, the time scale stays constant as you adjust the stretch factor. Notice that the OUTPUT frames don't move; they always cover the same duration. What changes is how far apart the INPUT frames are spaced, which determines how much of the original audio those output frames represent. When you slow down, input frames are selected closer together, so the same number of output frames covers a shorter segment of the original. When you speed up, input frames spread apart, covering more of the original in the same output duration.

See the math

The relationship between input spacing, output spacing, and stretch factor is:

Stretch Factor (α) = Output Spacing / Input Spacing

In technical literature, these are called the synthesis hop (Hₛ, output spacing) and analysis hop (Hₐ, input spacing):

α = Hₛ / Hₐ

Since the output spacing is fixed at half the window size (Hₛ = N/2 for 50% overlap), we derive the input spacing for any desired stretch factor:

Hₐ = Hₛ / α

For example, with a window of 2048 samples and 50% overlap (Hₛ = 1024):

  • 2× slower (α = 2): Hₐ = 1024 / 2 = 512 samples (75% input overlap)
  • Normal speed (α = 1): Hₐ = 1024 / 1 = 1024 samples (50% input overlap)
  • 2× faster (α = 0.5): Hₐ = 1024 / 0.5 = 2048 samples (0% input overlap)

How the Blending Works: Hann Windows

Now that we understand where frames come from and where they go, let's look at how they're blended together in the overlap regions.

Each frame is multiplied by a Hann window, a bell-shaped curve that's full strength in the middle and fades to zero at the edges. When two Hann-windowed frames overlap by exactly 50%, their weights sum to exactly 1.0 at every point:

This is the COLA property (Constant Overlap-Add), and it ensures consistent volume throughout the output. Where one window is fading out, the next is fading in by exactly the complementary amount. This is why the output spacing must be fixed at exactly half the window size: it's the precise value that makes overlapping Hann windows sum to unity.

See the math

Extracting a windowed frame

To extract frame number m from the input:

xₘ(r) = x(r + m·Hₐ) · w(r)

In plain terms: "Go to position m × input spacing in the original audio. Extract N samples centered there. Multiply each sample by the Hann window value at that position within the frame."

  • x — the input audio signal
  • m — which frame we're extracting (0, 1, 2, ...)
  • Hₐ — input spacing (also called the analysis hop)
  • r — position within the frame, relative to its center
  • w(r) — the Hann window value at position r (1.0 at center, tapering to 0 at edges)

Reconstructing the output

The output is the sum of all windowed frames, each placed at its designated position:

y(r) = Σₘ yₘ(r − m·Hₛ)

In plain terms: "To find the output value at any position r, add up the contributions from every frame that overlaps that position."

With 50% overlap, exactly two frames contribute at any point: one fading out, one fading in.

The COLA property

For the crossfade to produce constant volume, the overlapping window weights must sum to exactly 1 at every position:

Σₙ w(r − n·Hₛ) = 1    for all r, when Hₛ = N/2

In plain terms: "At any output position, add up the window weights from all overlapping frames. The sum must equal exactly 1."

The Hann window satisfies this property when frames are spaced at half the window size. This isn't arbitrary; it's the mathematical reason why the output spacing is fixed.

Smooth crossfading guarantees smooth amplitude. But what happens to the actual waveform in the overlap region?

The Problem with Basic OLA: Phase Misalignment

Here's the catch: when we stretch or compress audio, we break the continuity of the signal. In the original audio, each frame flows seamlessly into the next. But when we rearrange frames to new positions, that continuity is lost. The waveform in Frame 1 no longer picks up where Frame 0 left off. It's shifted in phase.

The visualization below shows this clearly using a simple sine wave. Watch the overlap region (highlighted in red) as you adjust the stretch factor:

In the overlap region, two segments of the sine wave are being blended together. But because they come from different positions in the original audio, they don't line up. When summed together, peaks can partially cancel troughs, creating amplitude variations and a hollow, "phasey" sound.

The audio demo alternates between a pure tone (single wave) and the overlap region (two misaligned waves blended). Listen for the difference, especially when the phase shift approaches 180°, where cancellation is most severe.

The amount of phase shift depends on the stretch factor. At certain values, the mismatch approaches 180° (complete phase inversion), causing maximum cancellation. This is the fundamental limitation of basic OLA.

See the math

Why the mismatch occurs

After placing Frame 0 in the output, the next audio we hear should be the natural continuation: whatever comes next in the original signal. If Frame 0 was read from position P in the input, the natural continuation starts at:

P + Hₛ    (one output spacing later)

But Frame 1's actual position in the input is:

Hₐ    (one input spacing from the start)

The mismatch between these two positions is:

Hₛ − Hₐ

This is non-zero whenever we're stretching or compressing (since stretching means Hₐ < Hₛ, and compression means Hₐ > Hₛ). The mismatch causes the waveform discontinuity we see in the overlap region.

Phase shift in terms of frequency

For a sine wave of frequency f, the phase shift (in degrees) caused by a sample offset of d samples is:

Phase shift = (d / sampleRate) × f × 360°

This explains why different frequencies experience different phase shifts, and why complex audio (with many frequencies) sounds "phasey" rather than having a single clean cancellation point.

How This Applies to Real Audio

We're using a slow 1-second alternation here for clarity, but in actual time-stretched audio, these transitions happen ~43 times per second (with output spacing of 1024 samples at 44.1kHz). The overlap regions fly by so fast that you don't hear discrete "phases." Instead, the cumulative effect of all those tiny misalignments creates a continuous hollow, metallic quality.

Real music is far more complex than a single sine wave. A guitar chord, a vocal, or a full mix contains hundreds of frequency components simultaneously. Each frequency experiences a different effective phase shift in the overlap region. Some frequencies might align well by chance, while others cancel badly. The result is an unpredictable, "underwater" coloration that sounds unnatural, even though the pitch is technically correct.

How WSOLA Fixes This

With basic OLA, you might get lucky: some stretch factors happen to produce good waveform alignment by coincidence. But you don't want to be limited to only those "magic" values. You want to stretch or compress to any degree while maintaining quality.

That's where the "WS" in WSOLA comes in: Waveform Similarity. Instead of reading each frame from a fixed position, WSOLA allows some flexibility. It searches within a tolerance range around the nominal position to find where the waveform best continues from the previous frame.

Try it yourself. Use the slider below to adjust Frame 1's position within the tolerance range. Watch how the phase alignment in the overlap region changes, and how you can find a position where the waveforms match up:

What You're Seeing

In the INPUT, the faded dashed box shows Frame 1's nominal position (where basic OLA would read it), while the solid box shows its adjusted position after you move the slider. The shaded tolerance zone shows how far you're allowed to search.

In the OUTPUT, watch the overlap region. When the phase shift approaches 0°, the blue and orange waveforms align, the overlap turns green and displays "ALIGNED!" When misaligned, you see the same cancellation problem from basic OLA.

The Automated Version

WSOLA doesn't require manual adjustment. It automatically finds the best position using cross-correlation. For each frame, the algorithm tests all positions within the tolerance range and picks the one where the waveform most closely matches the expected continuation from the previous frame.

This is the "Waveform Similarity" search: find the adjustment that maximizes the similarity between what we expect to hear next and what we actually read from the input.

See the math

The tolerance search happens in four steps for each frame:

Step 1: Define the "natural continuation"

After placing frame m (which was read from position m·Hₐ + Δₘ in the input), what audio would naturally follow? It's whatever comes one output spacing later:

x̃ₘ(r) = x(r + m·Hₐ + Δₘ + Hₛ)

This is the audio that frame m+1 needs to blend smoothly with.

Step 2: Define the search region

Around frame m+1's nominal position, we extract an extended chunk that spans the full tolerance range:

Search region: (m+1)·Hₐ ± Δmax

This region contains all the candidate positions we might read frame m+1 from.

Step 3: Find the optimal shift

We test every possible shift δ within the tolerance range and compute how well each candidate aligns with the natural continuation:

Δₘ₊₁ = argmaxδ Σᵣ x̃ₘ(r) · x(r + (m+1)·Hₐ + δ)

In plain terms: "Slide the candidate frame across the natural continuation. At each position, multiply corresponding samples and sum them (cross-correlation). The shift with the highest sum wins."

  • Δₘ₊₁ — the winning shift for frame m+1
  • argmax — "which δ maximizes this?"
  • Σᵣ — sum over the overlap region

Step 4: Extract the adjusted frame

Read frame m+1 from the optimized position:

Adjusted position = (m+1)·Hₐ + Δₘ₊₁

Apply the Hann window and place it at its fixed output position. The waveform now aligns smoothly with the previous frame.

Typical tolerance values

The tolerance range (Δmax) is typically 10-25ms, or around 441-1102 samples at 44.1kHz. Larger tolerances find better matches but risk "temporal drift," pulling audio from so far away that events shift noticeably from their original timing.

Reality Check: Complex Signals

With our simple sine wave, you can find positions where the waveforms align perfectly. Depending on the frequency and tolerance range, there may be multiple such positions (one per cycle). Real music isn't so clean. A guitar chord contains dozens of frequencies; a full mix contains thousands. Each frequency would need a different shift to align perfectly, but we can only choose one position for the entire frame.

So WSOLA finds a compromise: the position that produces the best overall match. The dominant frequencies (often the fundamental pitch and lower harmonics) tend to drive the correlation, so those align well. Higher frequencies and transients may still have some misalignment, but the result is much better than basic OLA's fixed positions.

Where WSOLA Breaks Down: Transient Artifacts

WSOLA handles sustained sounds well: vocals, strings, pads. But it has an Achilles' heel: transients. Drum hits, percussive attacks, and other short, sharp sounds don't behave like periodic waveforms, and WSOLA's frame-based approach causes characteristic artifacts.

The visualization below shows four transients (A, B, C, D) as vertical spikes. Watch what happens to them as you change the stretch factor:

Transient Doubling (Stretching)

When you slow down, frames are spaced closer together in the input. This means they overlap heavily, and the same transient gets captured by multiple frames. Each frame places its copy of the transient at a different position in the output, causing audible stuttering or "flamming."

The more extreme the stretch, the closer together the input frames, the more frames capture each transient, and the worse the doubling. Try increasing the stretch factor to 2×, 3×, or 4× to see transients labeled A1, A2, A3... as they multiply.

Transient Loss (Compression)

When you speed up significantly (beyond 2× speed), input frames are spaced so far apart that gaps appear between them: regions of the original audio that no frame ever reads. Any transient falling in these gaps is lost entirely.

Try reducing the stretch factor below 0.5× (2× speed). You'll see red-shaded gaps appear in the input, and transients that fall in those gaps simply vanish from the output.

Why WSOLA Can't Fix This

The cross-correlation search looks for repeating patterns; that's how it finds optimal alignment. But transients are, by definition, aperiodic. There's no repeating structure to lock onto. The algorithm has no mechanism to detect "this is a drum hit, handle it specially."

This is why extreme time-stretching (beyond ~2×) on percussive material sounds problematic, and why more advanced algorithms use explicit transient detection to identify and protect these events, at the cost of additional complexity.

See the math

How many frames capture a transient?

A transient is captured by any frame whose window overlaps it. Since each frame spans N samples (the window size), and frames are spaced Hₐ samples apart in the input, the number of frames that capture a given transient is approximately:

Frames capturing transient ≈ N / Hₐ

With the tolerance search, frames can shift by ±Δmax, potentially extending this:

Maximum frames ≈ (N + 2·Δmax) / Hₐ

Example: 2× stretch

With window N = 2048, input spacing Hₐ = 512 (for 2× stretch), and tolerance Δmax = 512:

  • Without tolerance: 2048 / 512 = 4 frames
  • With tolerance: (2048 + 1024) / 512 = 6 frames

Each of these frames places the transient at a different output position, spaced ~23ms apart (1024 samples at 44.1kHz). The result: 4-6 distinct "echoes" of the same drum hit.

When gaps appear (compression)

Gaps occur when input spacing exceeds window size:

Gap size = Hₐ − N    (when Hₐ > N)

At 4× speed (α = 0.25), with Hₛ = 1024: Hₐ = 1024 / 0.25 = 4096 samples. Gap = 4096 − 2048 = 2048 samples (46ms of audio skipped between each frame).

In Practice: Good Enough Is Good Enough

These artifacts are acceptable for most real-world use cases. In a music practice app, the goal isn't audiophile-grade fidelity; it's to help you learn. Slowing down a fast passage to 70% or 80% speed lets you hear the notes clearly, work out fingerings, and build muscle memory. The slight artifacts at these moderate stretch factors are barely noticeable, and a small price to pay for the practice benefit.

It's only at extreme ratios (slowing to 25% or speeding up to 4×) that artifacts become obvious. And in those cases, you're usually not trying to appreciate the music anyway; you're isolating a tricky passage or skimming through a track. WSOLA remains effective across the range where it matters most.

Wrapping Up

That was a lot of diagrams and waveforms. If you're feeling a bit overwhelmed, that's normal. Internalizing an algorithm like this takes time and review. But hopefully, having walked through each piece in detail, you now see how they fit together.

Here's the journey we took:

  1. The problem: Simply stretching or compressing audio samples changes the pitch. Unacceptable for musicians who want to slow down a song while keeping it in tune.
  2. The insight: Instead of stretching the samples themselves, we chunk the audio into overlapping frames and rearrange them. Spread frames apart to slow down, bring them closer to speed up. Each chunk stays intact, so pitch is preserved.
  3. The first complication: Rearranging frames creates discontinuities where they overlap. We solve this with Hann window crossfading, smoothly blending the overlapping regions so there are no audible clicks.
  4. The second complication: Crossfading fixes amplitude, but not phase. When two waveforms are blended out of phase, they partially cancel each other, creating a hollow sound. This is basic OLA's limitation.
  5. WSOLA's innovation: Instead of reading each frame from a fixed position, allow some flexibility. Search within a tolerance range to find where the waveform best continues from the previous frame. This "Waveform Similarity" search is what makes WSOLA sound natural.
  6. The tradeoff: WSOLA handles sustained sounds well but struggles with transients, which lack the periodic structure the algorithm relies on. At extreme stretch factors, you'll hear artifacts, but for typical practice ranges, the quality is excellent.

The next time you slow down a tricky guitar solo or speed through a long practice track, you'll know exactly what's happening under the hood. And if you hear a slight flutter on a drum hit at 50% speed—well, now you know why.

Try Time-Stretching Yourself

Practice Session uses high-quality time-stretching to help you learn music at any tempo. Slow down fast passages, loop sections, and practice at your own pace.

Download Free Trial

References

  1. Verhelst, W., & Roelands, M. (1993). An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 2, pp. 554-557.
  2. Driedger, J., & Müller, M. (2016). A Review of Time-Scale Modification of Music Signals. Applied Sciences, 6(2), 57.