Machine Learning for Signal Processing

(ENGR-E 511; CSCI-B 590)

Homework 3

P1: Instantaneous Source Separation [4 points]

1. As you might have noticed from my long hair, I’ve got a rock spirit. However, for this

homework I dabbled in composing a piece of jazz music. The song’s title is boring: Homework

3.

2. From x ica 1.wav to x ica 20.wav are 20 recordings of my song, Homework 3. Each recording

has N time domain samples. In this music there are K unknown number of musical sources

played at the same time. In other words, as I wanted to disguise the number of sources, I

created unnecessarily many recordings of this simple music. This can be seen as a situation

where the source was mixed up by multiplying a 20 × K mixing matrix A to the K sources,

creating the 20-channel mixture:

x1(t)

x2(t)

.

.

.

x20(t)

= A

s1(t)

s2(t)

.

.

.

sK(t)

(1)

3. But, as you’ve learned how to do source separation using ICA, you should be able to separate

them out into K clean instrument sources.

4. First, you don’t like the fact that there are too many recordings for this separation problem,

because you have a feeling that the number of sources is a lot smaller than 20. So, you decided

to do a dimension reduction first, before you actually go ahead and do ICA. For this, you

1

choose to perform PCA with the whitening option. Apply your PCA algorithm on your data

matrix X, a 20 × N matrix. Don’t forget to whiten the data. Make a decision as to how

many dimensions to keep, which will correspond to your K. Hint: take a very close look at

your eigenvalues, because there is a rather small element that is useful for separation.

5. On your whitened/dimension reduced data matrix Z ∈ R

K×N , apply ICA. At every iteration

of the ICA algorithm, use these as your update rules:

∆W ←

NI − g(Y )f(Y )

⊤

W

W ← W + ρ∆W

Y ← W Z

where

W : The ICA unmixing matrix you’re estimating

Y : The K × N source matrix you’re estimating

Z : Whitened/dim reduced version of your input (using PCA)

g(x) : tanh(x)

f(x) : x

3

ρ : learning rate

N : number of samples

6. Enjoy your music. Submit K separated sources (by embedding audio player to your .HTML

version), source code, and the convergence graph.

7. Implementation notes: Depending on the choice of the learning rate, the convergence of the

ICA algorithm varies. But I always see the convergence in from 5 sec to 90 sec on my iMac.

P2: Single-channel Source Separation [4 points]

1. You can use a variety of error functions for the NMF algorithm. An interesting one is to see

the matrices as a probabilistic distribution though not normalized. I actually prefer this one

than the one I introduced in L6 S35. From this, you can come up with this error function:

E(X||WH) = X

i,j

Xi,j log Xi,j

Xˆ

i,j

− Xi,j + Xˆ

i,j , (2)

where Xˆ = WH. I wouldn’t bother you with a potential question that might ask you to

derive the update rules (because we will see it later this semester). Instead, I’ll just give them

away here:

W ← W ⊙

X

WH H⊤

1

F ×TH⊤

(3)

H ← H ⊙

W⊤ X

WH

W⊤1

F ×T

. (4)

Note that 1

F ×T

is a matrix full of ones, whose size is F × T, where F and T are the number

of rows and columns of your input matrix X, respectively. Please note that, once again, we

have multiplicative update rules. Therefore, once you initialized your parameter matrices W

and H with nonnegative random values, you can keep their nonnegativity as long as your

input matrix X is nonnegative as well.

2. trs.wav is a speech signal of a speaker. Load it and convert it into the time-frequency domain

by using STFT with a frame size of 1024 samples with 50% overlap. Use Hann windows. This

will give 513×990 complex-valued matrix (You could get slightly more or fewer columns than

mine depending on your STFT setup).

3. Take the magnitudes of this matrix. Let’s call these magnitudes S, which is nothing but a

matrix of nonnegative elements. Learn an NMF model out of this, such as S ≈ WSHS,

using the update rules (3) and (4). You know, WS is a set of basis vectors. If you choose

to learn this NMF model with 30 basis vectors, then WS ∈ R

513×30

+ , where R+ is a set of

nonnegative real numbers. You’re going to use WS for your separation.

4. Repeat this process for your other training signal trn.wav. Learn another NMF model from

trn.wav, which is another training signal for your noise. From this get WN .

5. x nmf.wav is a noisy speech signal made of the same speaker’s different speech and the same

type of noise you saw. By using our third NMF model, we’re going to denoise this one.

Load this signal and convert it into a spectrogram X ∈ C

513×131. Let’s call its magnitude

spectrogram Y = |X| ∈ R

513×131

+ . Your third NMF will learn this approximation:

Y ≈ [WSWN ]H. (5)

What this means is that for this third NMF model, instead of learning new basis vectors, you

reuse the ones you trained from the previous two models as your basis vectors for testing:

W = [WSWN ]. As you are very sure that the basis vectors for your test signal should be

the same with the ones you trained from each of the sources, you initialize your W matrix with

the trained ones and don’t even update it during this third NMF. Instead, you learn a whole

new H ∈ R

60×131

+ that tells you the activation of the basis vectors for every given time frame.

Implementation is simple. Skip the update for W. Update H by using W = [WSWN ] and

Y . Repeat. Note: You’re doing a lot of element-wise division, so be careful not to divide by

zero. To prevent this, I tend to add a very very small value (e.g. 10−20) to every denominator

element in the update rule.

6. Because WSH(1:30,:) can be seen as the speech source estimate, you can also create a masking

matrix out of it:

M¯ =

WSH(1:30,:)

WSH(1:30,:) + WN H(31:60,:)

=

WSH(1:30,:)

[WSWN ]H

. (6)

Use this masking matrix to recover your speech source, i.e. Sˆ = M¯ ⊙ X. Note that you are

multiplying a nonnegative masking matrix M¯ to a complex matrix X, whose result is also

complex-valued. Invert back to the time domain. Listen to it to see if the noise is suppressed.

Submit the audio result and source code.

3

P3: Motor Imagery [4 points]

1. eeg.mat has the training and testing samples and their labels. Use them to replicate my

classification experiments in Module 5 (not the entire lecture, but from S3 to S8 and S37).

But, instead of na¨ıve Bayes classification, do a kNN classification. Report your classification

accuracies from various choices of the number of PCs and the number of neighbors (I think a

table of accuracies will be great). You don’t have to submit all the intermediate plots. What

I need are the accuracy value and your source code.

P4: Neural Network for Source Separation [4 points]

1. When you were attending IUB, you took a course taught by Prof. K. Since you really liked his

lectures, you decided to record them without the professor’s permission. You felt awkward,

but you did it anyway because you really wanted to review his lectures later.

2. Although you meant to review the lecture every time, it turned out that you never listened

to it. After you graduated, you realized that a lot of concepts you face at work were actually

covered by Prof. K’s class. So, you decided to revisit the lectures and study the materials

once again using the recordings.

3. You should have reviewed your recordings earlier. It turned out that there was a fellow student

who used to sit next to you and always ate chips in the middle of the class right beside your

microphone. So, Prof. K’s beautiful deep voice was contaminated by the annoying chip-eating

noise. So, you decided to build a simple NN-based speech denoiser that takes a noisy speech

spectrum (speech plus chip-eating noise) and then produces a cleaned-up speech spectrum.

4. NN trs.wav and NN trn.wav are the speech and noise signals you are going to use for training

the network. Load them. Let’s call the variables s and n. Add them up. Let’s call this noisy

signal x. Each of them must be a 403,255-dimensional column vector.

5. Transform the three vectors using STFT (frame size 1024, hop size 512, Hann windowing).

Then, you can come up with three complex-valued matrices, S, N, X, each of which has

about 800 spectra. A spectrum should be with 513 Fourier coefficients. |X| is your input

matrix and its column vector is one input sample.

6. Define an Ideal Binary Mask (IBM) M by comparing S and N:

Mf,t =

1 if |Sf,t| > |Nf,t|

0 otherwise ,

whose column vectors are the target samples.

7. Train a neural network with whatever structure you’d like. A baseline could be a shallow

neural network with a single hidden layer, which has 50 hidden units. For the hidden layer,

you can use tanh (or whatever activation function you prefer, e.g. rectified linear units).

But, for the output layer, you have to apply a logistic function to each of your 513 output

units rather than any other activation functions, because you want your network output to be

ranged between 0 and 1 (remember, you’re predicting a binary mask!). Your baseline shallow

tanh network should work to some degree, and once the performance is above our criterion,

you’ll get a full score.

4

8. NN tex.wav and NN tes.wav are the noisy test signal and its corresponding ground truth

clean speech. Load them and apply STFT as before. Feed the magnitude spectra of the test

mixture |Xtest| to your network and predict their masks Mtest (ranged between 0 and 1).

Then, you can recover the (complex-valued) speech spectrogram of the test signal in this way:

Xtest ⊙ Mtest.

9. Recover the time domain speech signal by applying an inverse-STFT on Xtest ⊙Mtest. Let’s

call this cleaned-up test speech signal sˆ. From NN tes.wav, you can load the ground truth

clean test speech signal s. Report their Signal-to-Noise Ratio (SNR):

SNR = 10 log10

s

⊤s

(s − sˆ)⊤(s − sˆ)

. (7)

10. Note: My shallow network implementation converges in 5000 epoch, which never takes more

than 5 minutes using my laptop CPU. Don’t bother learning GPU computing for this problem.

Your network should give you at least 6 dB SNR.

11. Note: DO NOT use Tensorflow, PyTorch, or any other package that calculates gradients for

you. You need to come up with your own backpropagation algorithm.

5

## Reviews

There are no reviews yet.