# CSE176 Introduction to Machine Learning Homework set #2 SOLVED

\$25.00

Category:

## Description

Exercise 1: Euclidean distance classifier (10 points). A Euclidean distance classifier represents
each class k = 1, . . . , K by a prototype vector µk ∈ R
D and classifies a pattern x ∈ R
D as the class of its
closest prototype: k
∗ = arg mink=1,…,K kx − µkk. Prove that a Gaussian classifier with shared isotropic
covariances (i.e., of the form Σk = σ
2
I for k = 1, . . . , K, where σ > 0) and equal class priors (i.e.,
p(C1) = · · · = p(CK) = 1
K
) is equivalent to a Euclidean distance classifier. Prove the class discriminant
functions g1(x), . . . , gK(x) are linear and give the expression that defines them.
Exercise 2: bias and variance of an estimator (20 points). Assume we have a sample X =
{x1, . . . , xN } ⊂ R of N iid (independent identically distributed) scalar random variables, each of which
is drawn from a Gaussian distribution N (µ, σ2
) with µ ∈ R and σ > 0. We want to estimate the mean
µ of this Gaussian by computing a statistic of the sample X . Consider the following four different
statistics of the sample:
1. φ1(X ) = 7.
2. φ2(X ) = x1.
3. φ3(X ) = 1
N
PN
n=1 xn.
4. φ4(X ) = x1x2.
For each statistic φ, compute:
• (2 points) Its bias bµ(φ) = EX {φ(X )} − µ.
• (2 points) Its variance var {φ} = EX {(φ(X ) − EX {φ(X )})
2}.
• (1 point) Its mean square error e(φ, µ) = EX {(φ(X ) − µ)
2}.
Based on that, answer the following for each estimator (statistic): is it unbiased? is it consistent?
Hint: expectations wrt the distribution of the N-point sample X are like this one:
EX {φ(X )} =
Z
φ(x1, . . . , xN ) p(x1, . . . , xN ) dx1 . . . dxN
iid=
Z
φ(x1, . . . , xN ) p(x1). . . p(xN ) dx1 . . . dxN .
Exercise 3: PCA and LDA (30 points). Consider 2D data points coming from a mixture of two
Gaussians with equal proportions, different means, and equal, diagonal covariances (where µ, σ1, σ2 > 0):
x ∈ R
2
: p(x) = π1 p(x|1) + π2 p(x|2) p(x|1) ∼ N (µ1
, Σ1), p(x|2) ∼ N (µ2
, Σ2),
π1 = π2 =
1
2
, µ1 = 0, µ2 =

µ
0

, Σ1 = Σ2 =

σ
2
1
0
0 σ
2
2

.
1. (5 points) Compute the mean µ and covariance Σ of the mixture distribution p(x).
Hint: let p(x) = PK
k=1 πk p(x|k) for x ∈ RD be a mixture of K densities, where π1, . . . , πK ∈ [0, 1] and PK
k=1 πk = 1 are the
component proportions (prior probabilities) and p(x|k), for k = 1, . . . , K, the component densities (e.g. Gaussian, but not necessarily).
Let µk = Ep(x|k) {x} and Σk = Ep(x|k)

(x − µk
)(x − µk
)
T

be the mean and covariance of component density k, for k = 1, . . . , K.
Then, the mean and covariance of the mixture are (you should be able to prove this statement):
µ = Ep(x) {x} =
XK
k=1
πkµk Σ = Ep(x)
n
(x − µ)(x − µ)
T
o
=
XK
k=1
πk

Σk + µkµ
T
k

− µµT
.
2. (5 points) Compute the eigenvalues λ1 ≥ λ2 ≥ 0 and corresponding eigenvectors u1, u2 ∈ R
2 of
Σ. Can we have λ2 > 0?
3. (2 points) Find the PCA projection to dimension 1.
4. (5 points) Compute the within-class and between-class scatter matrices SW , SB of p.
5. (6 points) Compute the eigenvalues ν1 ≥ ν2 ≥ 0 and corresponding eigenvectors v1, v2 ∈ R
2 of
S
−1
W SB. Can we have ν2 > 0?
6. (2 points) Compute the LDA projection.
7. (5 points) When does PCA find the same projection as LDA? Give a condition and explain it.
Exercise 4: variations of k-means clustering (30 points). Consider the k-means error function:
E({µk}
K
k=1,Z) = X
N
n=1
X
K
k=1
znkkxn − µkk
2
s.t. Z ∈ {0, 1}
NK, Z 1 = 1
over the centroids µ1
, . . . , µK and cluster assignments ZN×K, given training points x1, . . . , xN ∈ R
D.
• Variation 1: in k-means, the centroids can take any value in R
D: µk ∈ R
D ∀k = 1, . . . , K. Now
we want the centroids to take values from among the training points only: µk ∈ {x1, . . . , xN }
∀k = 1, . . . , K.
1. (8 points) Design a clustering algorithm that minimizes the k-means error function but
respecting the above constraint. Your algorithm should converge to a local optimum of the
error function. Give the steps of the algorithm explicitly.
2. (2 points) Can you imagine when this algorithm would be useful, or preferable to k-means?
• Variation 2: in k-means, we seek K clusters, each characterized
by a centroid µk
. Imagine we seek instead K lines (or hyperplanes,
in general), each characterized by a weight vector wk ∈ R
D and
bias wk0 ∈ R, given a supervised dataset {(xn, yn)}
N
n=1 (see figure).
Data points assigned to line k should have minimum least-squares
error P
n∈line k
(yn − wT
k xn − wk0)
2
.
x
y
1. (8 points) Give an error function that allows us to learn the lines’ parameters {wk, wk0}
K
k=1.
2. (12 points) Give an iterative algorithm that minimizes that error function.
Exercise 5: mean-shift algorithm (10 points). Consider a Gaussian kernel density estimate
p(x) = X
N
n=1
p(x|n)p(n) = 1
N(2πσ2
)
D/2
X
N
n=1
e
− 1
2 k
x−xn
σ k
2
x ∈ R
D.
Derive the mean-shift algorithm, which iterates the following expression:
x ←
X
N
n=1
p(n|x)xn where p(n|x) = p(x|n)p(n)
p(x)
=
exp

1
2
k(x − xn)/σk
2

PN
n′=1 exp

1
2
k(x − xn′)/σk
2

until convergence to a maximum of p (or, in general, a stationary point of p, satisfying ∇p(x) = 0).
Hint: take the gradient of p wrt x, equate it to zero and rearrange the resulting expression.
Bonus exercise: nonparametric regression (20 points). Consider the Gaussian kernel smoother
g(x) = X
N
n=1
K

k(x − xn)/hk

PN
n′=1 K

k(x − xn′)/hk
yn where K


x − xn
σ


∝ exp

1
2
k(x − xn)/σk
2

estimated on a training set {(xn, yn)}
N
n=1 ⊂ R
Dx × R
Dy
.
1. (7 points) What is g(x) if the training set has only one point (N = 1)? Explain.
Sketch the solution in 1D (i.e., when both xn, yn ∈ R).
Compare with using a least-squares linear regression.
2. (13 points) Prove that, with N = 2 points, we can write g(x) = α(x) y1 + (1 − α(x)) y2 where
α(x) can be written using the logistic function. Give the detailed expression for α(x).
Sketch the solution in 1D.
Compare with using a least-squares linear regres