You have probably heard of the song [Equation] by Aphex Twin. It is known for having a scary face hidden in its spectrum.

So I asked myself, how can I do this by myself? It is possible using a bit of matlab and basic knowledge in signal processing.

I choose a short mp3 track to modify: Cartoon (Interlude) by French House Artist SabastiAn.

This is the image we’ll try to hide:

Skull
Skull

Here is the matlab code and an explanation for it.

clear;

fname = '12 Cartoon (Interlude).mp3';
imgname = 'img1.png';
outname = 'skull.wav';

% duration/pos. of projected image (seconds)
duration = 15;
offset = 2;

window_fact = 0;
%apply_func = @(S, Img) S .* (-(Img > 0)*5 + 1);
apply_func = @(S, Img) S ./ (1 + Img * 2);

%Hz
Fmin = 7000;
Fmax = 12000;

% vertical resolution of projected image
% bigger value makes stripe-pattern, sound
% gets chirping
hres = 200;

% read audio
[y,Fs] = audioread(fname);
track_len = size(y,1);

% chunk size
chunk_s = floor(hres / (Fmax-Fmin) * Fs/2) * 2;

% channels
y1 = y(:,1);
y2 = y(:,2);

pad_len = chunk_s - mod(track_len, chunk_s);
track_len = track_len + pad_len;
y1 = [y1; zeros(pad_len, 1)];
y2 = [y2; zeros(pad_len, 1)];

y1 = reshape(y1, chunk_s, track_len/chunk_s);
y2 = reshape(y2, chunk_s, track_len/chunk_s);

% we have real data, so only take first half of fft
get_half = @(Y) flipud(Y(1:chunk_s/2+1,:));
get_whole = @(Y) [flipud(Y); conj(Y(2:end-1,:))];
Y1 = get_half(fft(y1));
Y2 = get_half(fft(y2));

%%%% find top frequency for later filtering
freq = fliplr(log(sum(abs(Y1) + abs(Y2),2))');
[m, index] = max(smooth(freq < min(freq) * 1.1, chunk_s/10) > 0.5);
filter_freq = floor(index / size(Y1,1) * Fs / 2);

%%%% load the image, put in right shape %%%%

I = imread(imgname);
if ndims(I) == 3
    I = rgb2gray(I);
end
I = imcomplement(I);
% I = imcomplement(I);
I = imresize(I, [hres duration*Fs/chunk_s]);
I = double(I);

% append zeros left and right

I = [zeros(hres, floor(offset*Fs/chunk_s)) I];
I = [I zeros(hres, size(Y1,2)-size(I,2))];

% append zeros top and bottom
I = [zeros(floor(chunk_s/2 - (chunk_s*Fmax/Fs)), size(Y1,2)); I];
I = [I; zeros((size(Y1,1) - size(I,1)), size(Y1,2))];

%%%% append window to filter %%%%
I = ifft(get_whole(I));
I = reshape(I, 1, track_len)';

window = hanning(size(I,1));
I = I .* (1 - window_fact + window_fact * window);

I = reshape(I', chunk_s, track_len/chunk_s);
I = get_half(fft(I));

%%%% append image %%%%
% plot before
colormap(hsv);
ax1 = subplot(2,1,1);
image(sqrt(abs(Y1)) * 15);
ax1.XTick = linspace(0, size(Y1,2), track_len/Fs);
ax1.XTickLabel = 0:track_len/Fs;
max_freq = floor(Fs / 2 / 1000) * 1000;
ax1.YTickLabel = linspace(max_freq, 0, max_freq/2000 + 1);
ax1.YTick = fliplr(linspace(max_freq, 0, max_freq/2000 + 1)/(Fs/2)*chunk_s/2);

Y1 = apply_func(Y1, I);
Y2 = apply_func(Y2, I);

% plot after
ax2 = subplot(2,1,2);
image(sqrt(abs(Y1)) * 15);
ax2.XTick = ax1.XTick;
ax2.XTickLabel = ax1.XTickLabel;
ax2.YTick = ax1.YTick;
ax2.YTickLabel = ax1.YTickLabel;

%%%% reversion %%%%
y1 = ifft(get_whole(Y1));
y2 = ifft(get_whole(Y2));

y1 = reshape(y1, 1, track_len)';
y2 = reshape(y2, 1, track_len)';

%%%% low-pass filter %%%%
df = designfilt('lowpassfir','FilterOrder',200,'CutoffFrequency',filter_freq/(Fs/2));
D = mean(grpdelay(df));

y1 = filter(df,[y1; zeros(D,1)]);
y2 = filter(df,[y2; zeros(D,1)]);
y1 = y1(D+1:end);
y2 = y2(D+1:end);

audiowrite(outname, [y1, y2], Fs);

First, there are many parameters to set:

  • names of the files to read in and write out
  • how long the part of the song with modified spectrum should be
  • where in the track to place the image
  • between which frequencies to place the image
  • spectrum resolution to work with

Basically, we then read in the song and apply a FFT on it. We than throw away the first half of the FFT because our input data was real, and thus it is just a reflection of the second half. Then we find the maximum frequency occurring in the track now, so we can use a lowpass filter later to get rid of artifacts.

The comes the interesting step: we load the image, and add it to the FFT after scaling. After that, we just reverse all the transformations we did before, and apply a lowpass filter.

This is the resulting image (I used Spek to view the spectrum):

Skull in spectrum
Skull in spectrum

If you listen closely in the modified track, you can hear the chirping and missing of some frequencies in the modified version. This is why I choose a french house track: There are many high frequencies so it won’t be that bad when some go missing.