StemConsole / Blog / How AI Stem Separation Works

How AI Stem Separation Works

Dan Murtagh · Mixing Engineer & Audio Educator

A finished song is one tangled wave of sound. AI can pull the vocal, drums and bass back out of it — here’s how, in plain English, and why some parts come out cleaner than others.

The problem: one file, many sources

When a song is mixed and mastered, every instrument is summed into a single waveform. There’s no “vocal track” hiding inside the MP3 — it’s all blended together. Separating it back out is like being handed a smoothie and asked to return the strawberries. For decades that was basically impossible. AI changed it.

Step 1: turn sound into a picture

The model first converts the audio into a spectrogram — a picture with time across the bottom, pitch up the side, and brightness showing how loud each frequency is at each moment. Suddenly the song is an image, and patterns become visible: a voice, a kick drum and a bassline all leave different visual fingerprints.

Step 2: predict a mask for each part

Trained on huge libraries of music where the separate parts are known, the model learns what each source looks like in a spectrogram. For your song it predicts a mask for each stem — essentially a filter that says “keep these pixels, drop those.” Apply the vocal mask and you keep the voice; apply the drum mask and you keep the kit.

That’s the whole trick: not “deleting” instruments, but predicting which parts of the sound belong to each one and rebuilding them as separate files.

Step 3: turn the picture back into sound

Each masked spectrogram is converted back to audio. Out come your stems — vocals, drums, bass, and on a 6-stem split, guitar and piano too. The whole thing runs in seconds on a GPU. StemConsole uses a state-of-the-art AI separation engine for this, then opens the results in a live mixer so you can hear them immediately.

Why some instruments separate cleaner than others

It comes down to overlap. Bass and drums live in fairly distinct frequency ranges — low end and transients — so the model rarely confuses them with anything else; they come out clean. Guitar, piano and vocals all crowd the mid-range and share harmonics, so on dense mixes you’ll hear faint artefacts where the model had to guess. (More on the building blocks in what is a stem in music.)

Where it still struggles

Heavy reverb, distortion, live recordings and very dense arrangements are the hard cases — there’s simply less clean information to recover. Any tool that promises flawless results on every track is overselling. The practical answer is to audition before you commit: the result is usually great on modern productions and good enough on most others.

Hear it on your own track
Free · No watermarks · No install

Frequently asked questions

How does AI separate vocals from a song?

The model converts the audio into a spectrogram (a picture of frequency over time), then predicts a 'mask' that keeps the parts belonging to each source — the voice, the drums, the bass — and removes the rest. Applying the mask and converting back to audio gives you each stem.

Why are some instruments harder to isolate?

Instruments that overlap in frequency are harder to separate. Bass and drums sit in fairly distinct ranges, so they come out cleanly; guitar, piano and vocals share a lot of the mid-range, so dense mixes leave more artefacts.

Is AI stem separation as good as the original studio stems?

No — original session stems are perfect by definition. AI reconstructs stems from a finished mix, so there can be faint artefacts. On most modern productions the result is clean enough to perform, practise and remix with.

DM

Dan Murtagh is a mixing engineer and audio educator, and the builder of StemConsole. He has spent years separating, mixing and teaching music — StemConsole is the stem tool he wanted to use himself.

More free tools