Machine Learning for Signal Processing

(ENGR-E 511; CSCI-B 590)

Homework 2

Instructions

• Submission format: Jupyter Notebook + HTML)

– Your notebook should be a comprehensive report, not just a code snippet. Mark-ups are

mandatory to answer the homework questions. You need to use LaTeX equations in the

markup if you’re asked.

– Google Colab is the best place to begin with if this is the first time using iPython

notebook. No need to use GPUs.

– Download your notebook as an .html version and submit it as well, so that the AIs can

check out the plots and audio. Here is how to convert to html in Google Colab.

– Meaning you need to embed an audio player in there if you’re asked to submit an audio

file

• Avoid using toolboxes.

P1: White Noise [4 points]

1. Have you ever wondered what it means by “white” noise? It’s actually from the light. When

the light is the sum of all visible frequencies, then it looks white to human eyes. If you pass the

light through a prism, then you all of a sudden see the rainbow colors, so called a “spectrum.”

Yes, the prism does an analogue version of the Fourier transform.

2. So, even if we don’t see the sound we listen to, if the signal consists of too many sinusoids

with different frequencies, it sounds “white.” I know, I know, it doesn’t make sense.

3. You may also want to note that the sample distribution of a white noise signal looks like a

Gaussian distribution, which is not news to us because we all know the central limit theorem.

4. x.wav is a speech signal contaminated by white noise. As I haven’t taught you guys how to

properly do speech enhancement yet, you’re not supposed to know a machine learning-based

solution to this problem (don’t worry I’ll cover it soon). Instead, you did learn how to do

STFT, so I want you to at least manually erase the white noise from this signal to recover

the clean speech source. For some reason, we know that the white noise added to the signal

doesn’t change its volume over time. So, what we’re going to do is to listen to the sound and

eyeball the spectrogram to find out the frames only with white noise. Then, we will build

our simple noise model, with which we will suppress the noise in the other speech-plus-noise

frames.

(Note: don’t forget to turn off the sampling rate option sr=None if you use librosa.load).

1

5. First off, create a DFT matrix F using the equation shown in M02-L01-S11 and S12. You’ll

of course create a N × N complex matrix, but if you see its real and imaginary versions

separately, you’ll see something like the ones in M02-L01-S14 (the ones in the slide are 20×20,

i.e. N = 20). For this problem let’s fix N = 1024.

6. Prepare your data matrix X. You extract the first frame of N samples from the input signal,

and apply a Hann window1

. What that means is that from the definition of Hann window, you

create a window of size N and element-wise multiply the window and your N audio samples.

Place it as your first column vector of the data matrix X. Move by N/2 samples. Extract

another frame of N samples and apply the window. This goes to the second column vector of

X. Do it for your third frame (which should start from (N + 1)’th sample, and so on. Since

you moved just by the half of the frame size, your frames are overlapping each other by 50%.

(Note: this time it’s okay to use the toolbox to calculate Hann windows.)

7. Apply the DFT matrix to your data matrix, i.e. Y = F X. This is your spectrogram with

complex values. See how it looks like (by taking magnitudes and plotting). For example, you

can use imshow in matplotlib.

8. In this spectrogram, identify frames that are only with noise2

. For example the ones at the

end of signal would be a good choice. Take a sample mean of the chosen column vectors (the

original magnitudes, not the exponentiated ones), e.g. M =

1

|Cnoise|

P

i∈Cnoise

|Y :,i|, where

Cnoise is the set of chosen frames and |Cnoise| is the number of frames. This is your noise

model.

9. Subtract M out of all the magnitude spectra, |Y |. This will give you some residual magnitudes with suppressed noise. Be careful with negative values: you don’t want them in

your “magnitude” spectra. One quick method to remove them is to turn them into zeros.

Get the original phase from the input spectrogram, i.e. Y /|Y | (element-wise division), and

multiply each of the phase values by the corresponding cleaned-up magnitude to recover the

complex-valued spectra of the estimated clean speech.

10. Multiply the inverse DFT matrix, which you can also create by using the equation in S12.

Let’s call this F

∗

. Since it’s the inverse transform, F

∗F ≈ I (you can check it, although the

off diagonals might be a very small number rather than zero). You multipy this matrix to

your spectrogram, which is with suppressed white noise, to get back to the recovered version

of your data matrix, Xˆ . In theory this should give you a real-valued matrix, but you’ll still

see some imaginary parts with a very small value. Ignore them by just taking the real part.

Reverse the procedure in 1.6 to get the time domain signal. Basically it must be a procedure

that transpose every column vector of Xˆ and overlap-and-add the right half of t-th row vector

with the left half of the (t + 1)-th row vector and so on. Listen to the signal to check if the

white noise is suppressed.

1https://en.wikipedia.org/wiki/Hann_function

https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.get_window.html

2Depending on the plotting function you use, it’s possible that you can’t really “see” the white noise. It’s because

your white noise is not loud enough. What you can do to better visualize this spectrogram is to exaggerate the small

magnitudes while suppress the large ones. For example, I can visualize |Y |

0.5

instead of |Y |, where exponentiation

is element-wise. Don’t worry about this visualization issue if you can see the white noise-only frames from your

spectrogram.

2

11. Submit your code and the denoised audio file. Do NOT use any STFT functions you can find

in toolboxes.

P2: DCT and PCA [4 points]

1. s.wav is a recording of Prof. K’s voice. Load it. Randomly select 8 consecutive samples out

of the 5,000,000 samples. This is your first column vector of your data matrix X. Repeat

this procedure 10 times. Then, the size of X is 8 × 10.

2. Calculate the covariance matrix out of this, whose size must be 8 ×8. Do eigendecomposition

and extract 8 eigenvectors, each of which is with 8 dimensions. Yes, you just did PCA. Plot

your W⊤ matrix and compare it to the DCT matrix shown in M02-L01-S21. Similar? Submit

your plot and code.

3. Create another data matrix with 100 samples, i.e. X ∈ R

8×100. Do PCA on this one. How

about 1,000 samples? Can you see your PCA is getting better with larger datasets? Why do

you feel that your PCA is getting better? Try to explain in comparison with the DCT matrix.

4. You just saw that PCA might be able to replace the pre-fixed DCT basis vectors. But, as

you can see in your matrices, they are not the same. Discuss the pros and cons of PCA and

DCT in your report.

P3: Stereo matching [3 points]

1. If you have multiple cameras taking the same scene from different positions, you can recover

the depth of the objects. That’s why we humans can recognize the distance of a visual object

(we have two eyes). See Figure 1 for an example. But, I guess that our brains are working

hard to indeed estimate the depths of the objects in the visual scene. In this problem we

mimic this process (without exactly knowing how brain works).

2. im0.ppm (left) and im8.ppm (right) are the pictures taken by two different camera positions3

.

If you load the images, they will be a three dimensional array of 381 × 430 × 3, whose third

dimension is for the three color channels (RGB). Let’s call them XL and XR. For the (i,j)-th

pixel in the right image, XR

(i,j,:), which is a 3-d vector of RGB intensities, we can scan and

find the most similar pixel in the left image at i-th row (using a metric of your choice). For

example, I did the search from XL

(i,j,:) to XL

(i,j+39,:), to see which pixel among the 40 are

the closest. I record the index-distance of the closest pixel. Let’s say that XL

(i,j+19,:) is the

most similar one to XR

(i,j,:). Then, the index-distance is 19. I record this index-distance (to

the closest pixel in the left image) for all pixels in my right image to create a matrix called

“disparity map”, D, whose (i, j)-th element says the index-distance between the (i, j)-th pixel

of the right image and its closest pixel in the left image. For an object in the right image

(e.g. the tree), if its pixels are associated with an object in the left image, but are shifted far

away, that means the object is close to the cameras, and vice versa.

3. Calculate the disparity map D from im0.ppm and im8.ppm, which will be a matrix of 381×390

(since we search within only 40 pixels). Vectorize the disparity matrix and draw a histogram.

How many clusters do you see?

3http://vision.middlebury.edu/stereo/data/

3

Left image Right image

Figure 1: The tree is closer than the mountain. So, from the left camera, the tree is located on

the right hand side, while the right camera captures it on the left hand side. On the contrary, the

mountain in the back does not have this disparity.

4. Submit your histogram and answer in the report. Submit your code that created the disparity

map, too.

P4: GMM and kmeans clustering for stereo matching [5 points]

1. Write up your own k-means clustering code, and cluster the disparity values in D. Each

value will belong to (only) one of the clusters. The number of clusters says the number of

depth levels. For example, in Figure 1, there are only two depths, so k = 2. If you replace

the disparity values with the cluster means, you can recover the depth map with k levels.

Plot your depth map (the disparity map replaced by the mean disparities as in the image

quantization examples) in gray scale–pixels of the frontal objects should be bright, while the

ones in the back get darker. Submit your plot along with your k-means clustering code.

2. Write up your own GMM clustering code, and cluster the disparity values in D. The posterior

probability will give you the (soft) membership of each value to one of the clusters. Recover

the depth map using the means you earned through GMM, and submit the plot along with

your code.

4

ENGR-E 511; CSCI-B 590

# (ENGR-E 511; CSCI-B 590) Homework 2

Original price was: $40.00.$35.00Current price is: $35.00.

## Reviews

There are no reviews yet.