## Description

Exercise 1: Euclidean distance classifier (10 points). A Euclidean distance classifier represents

each class k = 1, . . . , K by a prototype vector µk ∈ R

D and classifies a pattern x ∈ R

D as the class of its

closest prototype: k

∗ = arg mink=1,…,K kx − µkk. Prove that a Gaussian classifier with shared isotropic

covariances (i.e., of the form Σk = σ

2

I for k = 1, . . . , K, where σ > 0) and equal class priors (i.e.,

p(C1) = · · · = p(CK) = 1

K

) is equivalent to a Euclidean distance classifier. Prove the class discriminant

functions g1(x), . . . , gK(x) are linear and give the expression that defines them.

Exercise 2: bias and variance of an estimator (20 points). Assume we have a sample X =

{x1, . . . , xN } ⊂ R of N iid (independent identically distributed) scalar random variables, each of which

is drawn from a Gaussian distribution N (µ, σ2

) with µ ∈ R and σ > 0. We want to estimate the mean

µ of this Gaussian by computing a statistic of the sample X . Consider the following four different

statistics of the sample:

1. φ1(X ) = 7.

2. φ2(X ) = x1.

3. φ3(X ) = 1

N

PN

n=1 xn.

4. φ4(X ) = x1x2.

For each statistic φ, compute:

• (2 points) Its bias bµ(φ) = EX {φ(X )} − µ.

• (2 points) Its variance var {φ} = EX {(φ(X ) − EX {φ(X )})

2}.

• (1 point) Its mean square error e(φ, µ) = EX {(φ(X ) − µ)

2}.

Based on that, answer the following for each estimator (statistic): is it unbiased? is it consistent?

Hint: expectations wrt the distribution of the N-point sample X are like this one:

EX {φ(X )} =

Z

φ(x1, . . . , xN ) p(x1, . . . , xN ) dx1 . . . dxN

iid=

Z

φ(x1, . . . , xN ) p(x1). . . p(xN ) dx1 . . . dxN .

Exercise 3: PCA and LDA (30 points). Consider 2D data points coming from a mixture of two

Gaussians with equal proportions, different means, and equal, diagonal covariances (where µ, σ1, σ2 > 0):

x ∈ R

2

: p(x) = π1 p(x|1) + π2 p(x|2) p(x|1) ∼ N (µ1

, Σ1), p(x|2) ∼ N (µ2

, Σ2),

π1 = π2 =

1

2

, µ1 = 0, µ2 =

µ

0

, Σ1 = Σ2 =

σ

2

1

0

0 σ

2

2

.

1. (5 points) Compute the mean µ and covariance Σ of the mixture distribution p(x).

Hint: let p(x) = PK

k=1 πk p(x|k) for x ∈ RD be a mixture of K densities, where π1, . . . , πK ∈ [0, 1] and PK

k=1 πk = 1 are the

component proportions (prior probabilities) and p(x|k), for k = 1, . . . , K, the component densities (e.g. Gaussian, but not necessarily).

Let µk = Ep(x|k) {x} and Σk = Ep(x|k)

(x − µk

)(x − µk

)

T

be the mean and covariance of component density k, for k = 1, . . . , K.

Then, the mean and covariance of the mixture are (you should be able to prove this statement):

µ = Ep(x) {x} =

XK

k=1

πkµk Σ = Ep(x)

n

(x − µ)(x − µ)

T

o

=

XK

k=1

πk

Σk + µkµ

T

k

− µµT

.

2. (5 points) Compute the eigenvalues λ1 ≥ λ2 ≥ 0 and corresponding eigenvectors u1, u2 ∈ R

2 of

Σ. Can we have λ2 > 0?

3. (2 points) Find the PCA projection to dimension 1.

4. (5 points) Compute the within-class and between-class scatter matrices SW , SB of p.

5. (6 points) Compute the eigenvalues ν1 ≥ ν2 ≥ 0 and corresponding eigenvectors v1, v2 ∈ R

2 of

S

−1

W SB. Can we have ν2 > 0?

6. (2 points) Compute the LDA projection.

7. (5 points) When does PCA find the same projection as LDA? Give a condition and explain it.

Exercise 4: variations of k-means clustering (30 points). Consider the k-means error function:

E({µk}

K

k=1,Z) = X

N

n=1

X

K

k=1

znkkxn − µkk

2

s.t. Z ∈ {0, 1}

NK, Z 1 = 1

over the centroids µ1

, . . . , µK and cluster assignments ZN×K, given training points x1, . . . , xN ∈ R

D.

• Variation 1: in k-means, the centroids can take any value in R

D: µk ∈ R

D ∀k = 1, . . . , K. Now

we want the centroids to take values from among the training points only: µk ∈ {x1, . . . , xN }

∀k = 1, . . . , K.

1. (8 points) Design a clustering algorithm that minimizes the k-means error function but

respecting the above constraint. Your algorithm should converge to a local optimum of the

error function. Give the steps of the algorithm explicitly.

2. (2 points) Can you imagine when this algorithm would be useful, or preferable to k-means?

• Variation 2: in k-means, we seek K clusters, each characterized

by a centroid µk

. Imagine we seek instead K lines (or hyperplanes,

in general), each characterized by a weight vector wk ∈ R

D and

bias wk0 ∈ R, given a supervised dataset {(xn, yn)}

N

n=1 (see figure).

Data points assigned to line k should have minimum least-squares

error P

n∈line k

(yn − wT

k xn − wk0)

2

.

x

y

1. (8 points) Give an error function that allows us to learn the lines’ parameters {wk, wk0}

K

k=1.

2. (12 points) Give an iterative algorithm that minimizes that error function.

Exercise 5: mean-shift algorithm (10 points). Consider a Gaussian kernel density estimate

p(x) = X

N

n=1

p(x|n)p(n) = 1

N(2πσ2

)

D/2

X

N

n=1

e

− 1

2 k

x−xn

σ k

2

x ∈ R

D.

Derive the mean-shift algorithm, which iterates the following expression:

x ←

X

N

n=1

p(n|x)xn where p(n|x) = p(x|n)p(n)

p(x)

=

exp

−

1

2

k(x − xn)/σk

2

PN

n′=1 exp

−

1

2

k(x − xn′)/σk

2

until convergence to a maximum of p (or, in general, a stationary point of p, satisfying ∇p(x) = 0).

Hint: take the gradient of p wrt x, equate it to zero and rearrange the resulting expression.

Bonus exercise: nonparametric regression (20 points). Consider the Gaussian kernel smoother

g(x) = X

N

n=1

K

k(x − xn)/hk

PN

n′=1 K

k(x − xn′)/hk

yn where K

x − xn

σ

∝ exp

−

1

2

k(x − xn)/σk

2

estimated on a training set {(xn, yn)}

N

n=1 ⊂ R

Dx × R

Dy

.

1. (7 points) What is g(x) if the training set has only one point (N = 1)? Explain.

Sketch the solution in 1D (i.e., when both xn, yn ∈ R).

Compare with using a least-squares linear regression.

2. (13 points) Prove that, with N = 2 points, we can write g(x) = α(x) y1 + (1 − α(x)) y2 where

α(x) can be written using the logistic function. Give the detailed expression for α(x).

Sketch the solution in 1D.

Compare with using a least-squares linear regres