(ENGR-E 511; CSCI-B 590) Homework 3


Rate this product

Machine Learning for Signal Processing
(ENGR-E 511; CSCI-B 590)
Homework 3
P1: Instantaneous Source Separation [4 points]
1. As you might have noticed from my long hair, I’ve got a rock spirit. However, for this
homework I dabbled in composing a piece of jazz music. The song’s title is boring: Homework
2. From x ica 1.wav to x ica 20.wav are 20 recordings of my song, Homework 3. Each recording
has N time domain samples. In this music there are K unknown number of musical sources
played at the same time. In other words, as I wanted to disguise the number of sources, I
created unnecessarily many recordings of this simple music. This can be seen as a situation
where the source was mixed up by multiplying a 20 × K mixing matrix A to the K sources,
creating the 20-channel mixture:


= A


3. But, as you’ve learned how to do source separation using ICA, you should be able to separate
them out into K clean instrument sources.
4. First, you don’t like the fact that there are too many recordings for this separation problem,
because you have a feeling that the number of sources is a lot smaller than 20. So, you decided
to do a dimension reduction first, before you actually go ahead and do ICA. For this, you
choose to perform PCA with the whitening option. Apply your PCA algorithm on your data
matrix X, a 20 × N matrix. Don’t forget to whiten the data. Make a decision as to how
many dimensions to keep, which will correspond to your K. Hint: take a very close look at
your eigenvalues, because there is a rather small element that is useful for separation.
5. On your whitened/dimension reduced data matrix Z ∈ R
K×N , apply ICA. At every iteration
of the ICA algorithm, use these as your update rules:
∆W ←

NI − g(Y )f(Y )

W ← W + ρ∆W
Y ← W Z
W : The ICA unmixing matrix you’re estimating
Y : The K × N source matrix you’re estimating
Z : Whitened/dim reduced version of your input (using PCA)
g(x) : tanh(x)
f(x) : x
ρ : learning rate
N : number of samples
6. Enjoy your music. Submit K separated sources (by embedding audio player to your .HTML
version), source code, and the convergence graph.
7. Implementation notes: Depending on the choice of the learning rate, the convergence of the
ICA algorithm varies. But I always see the convergence in from 5 sec to 90 sec on my iMac.
P2: Single-channel Source Separation [4 points]
1. You can use a variety of error functions for the NMF algorithm. An interesting one is to see
the matrices as a probabilistic distribution though not normalized. I actually prefer this one
than the one I introduced in L6 S35. From this, you can come up with this error function:
E(X||WH) = X
Xi,j log Xi,j

− Xi,j + Xˆ
i,j , (2)
where Xˆ = WH. I wouldn’t bother you with a potential question that might ask you to
derive the update rules (because we will see it later this semester). Instead, I’ll just give them
away here:
W ← W ⊙
F ×TH⊤
H ← H ⊙
W⊤ X
F ×T
. (4)

Note that 1
F ×T
is a matrix full of ones, whose size is F × T, where F and T are the number
of rows and columns of your input matrix X, respectively. Please note that, once again, we
have multiplicative update rules. Therefore, once you initialized your parameter matrices W
and H with nonnegative random values, you can keep their nonnegativity as long as your
input matrix X is nonnegative as well.
2. trs.wav is a speech signal of a speaker. Load it and convert it into the time-frequency domain
by using STFT with a frame size of 1024 samples with 50% overlap. Use Hann windows. This
will give 513×990 complex-valued matrix (You could get slightly more or fewer columns than
mine depending on your STFT setup).
3. Take the magnitudes of this matrix. Let’s call these magnitudes S, which is nothing but a
matrix of nonnegative elements. Learn an NMF model out of this, such as S ≈ WSHS,
using the update rules (3) and (4). You know, WS is a set of basis vectors. If you choose
to learn this NMF model with 30 basis vectors, then WS ∈ R
+ , where R+ is a set of
nonnegative real numbers. You’re going to use WS for your separation.
4. Repeat this process for your other training signal trn.wav. Learn another NMF model from
trn.wav, which is another training signal for your noise. From this get WN .
5. x nmf.wav is a noisy speech signal made of the same speaker’s different speech and the same
type of noise you saw. By using our third NMF model, we’re going to denoise this one.
Load this signal and convert it into a spectrogram X ∈ C
513×131. Let’s call its magnitude
spectrogram Y = |X| ∈ R
+ . Your third NMF will learn this approximation:
Y ≈ [WSWN ]H. (5)
What this means is that for this third NMF model, instead of learning new basis vectors, you
reuse the ones you trained from the previous two models as your basis vectors for testing:
W = [WSWN ]. As you are very sure that the basis vectors for your test signal should be
the same with the ones you trained from each of the sources, you initialize your W matrix with
the trained ones and don’t even update it during this third NMF. Instead, you learn a whole
new H ∈ R
+ that tells you the activation of the basis vectors for every given time frame.
Implementation is simple. Skip the update for W. Update H by using W = [WSWN ] and
Y . Repeat. Note: You’re doing a lot of element-wise division, so be careful not to divide by
zero. To prevent this, I tend to add a very very small value (e.g. 10−20) to every denominator
element in the update rule.
6. Because WSH(1:30,:) can be seen as the speech source estimate, you can also create a masking
matrix out of it:
M¯ =
WSH(1:30,:) + WN H(31:60,:)
. (6)
Use this masking matrix to recover your speech source, i.e. Sˆ = M¯ ⊙ X. Note that you are
multiplying a nonnegative masking matrix M¯ to a complex matrix X, whose result is also
complex-valued. Invert back to the time domain. Listen to it to see if the noise is suppressed.
Submit the audio result and source code.
P3: Motor Imagery [4 points]
1. eeg.mat has the training and testing samples and their labels. Use them to replicate my
classification experiments in Module 5 (not the entire lecture, but from S3 to S8 and S37).
But, instead of na¨ıve Bayes classification, do a kNN classification. Report your classification
accuracies from various choices of the number of PCs and the number of neighbors (I think a
table of accuracies will be great). You don’t have to submit all the intermediate plots. What
I need are the accuracy value and your source code.
P4: Neural Network for Source Separation [4 points]
1. When you were attending IUB, you took a course taught by Prof. K. Since you really liked his
lectures, you decided to record them without the professor’s permission. You felt awkward,
but you did it anyway because you really wanted to review his lectures later.
2. Although you meant to review the lecture every time, it turned out that you never listened
to it. After you graduated, you realized that a lot of concepts you face at work were actually
covered by Prof. K’s class. So, you decided to revisit the lectures and study the materials
once again using the recordings.
3. You should have reviewed your recordings earlier. It turned out that there was a fellow student
who used to sit next to you and always ate chips in the middle of the class right beside your
microphone. So, Prof. K’s beautiful deep voice was contaminated by the annoying chip-eating
noise. So, you decided to build a simple NN-based speech denoiser that takes a noisy speech
spectrum (speech plus chip-eating noise) and then produces a cleaned-up speech spectrum.
4. NN trs.wav and NN trn.wav are the speech and noise signals you are going to use for training
the network. Load them. Let’s call the variables s and n. Add them up. Let’s call this noisy
signal x. Each of them must be a 403,255-dimensional column vector.
5. Transform the three vectors using STFT (frame size 1024, hop size 512, Hann windowing).
Then, you can come up with three complex-valued matrices, S, N, X, each of which has
about 800 spectra. A spectrum should be with 513 Fourier coefficients. |X| is your input
matrix and its column vector is one input sample.
6. Define an Ideal Binary Mask (IBM) M by comparing S and N:
Mf,t =

1 if |Sf,t| > |Nf,t|
0 otherwise ,
whose column vectors are the target samples.
7. Train a neural network with whatever structure you’d like. A baseline could be a shallow
neural network with a single hidden layer, which has 50 hidden units. For the hidden layer,
you can use tanh (or whatever activation function you prefer, e.g. rectified linear units).
But, for the output layer, you have to apply a logistic function to each of your 513 output
units rather than any other activation functions, because you want your network output to be
ranged between 0 and 1 (remember, you’re predicting a binary mask!). Your baseline shallow
tanh network should work to some degree, and once the performance is above our criterion,
you’ll get a full score.
8. NN tex.wav and NN tes.wav are the noisy test signal and its corresponding ground truth
clean speech. Load them and apply STFT as before. Feed the magnitude spectra of the test
mixture |Xtest| to your network and predict their masks Mtest (ranged between 0 and 1).
Then, you can recover the (complex-valued) speech spectrogram of the test signal in this way:
Xtest ⊙ Mtest.
9. Recover the time domain speech signal by applying an inverse-STFT on Xtest ⊙Mtest. Let’s
call this cleaned-up test speech signal sˆ. From NN tes.wav, you can load the ground truth
clean test speech signal s. Report their Signal-to-Noise Ratio (SNR):
SNR = 10 log10
(s − sˆ)⊤(s − sˆ)
. (7)
10. Note: My shallow network implementation converges in 5000 epoch, which never takes more
than 5 minutes using my laptop CPU. Don’t bother learning GPU computing for this problem.
Your network should give you at least 6 dB SNR.
11. Note: DO NOT use Tensorflow, PyTorch, or any other package that calculates gradients for
you. You need to come up with your own backpropagation algorithm.


There are no reviews yet.

Be the first to review “(ENGR-E 511; CSCI-B 590) Homework 3”

Your email address will not be published. Required fields are marked *

Scroll to Top