## Description

1. The probability density function of normal distribution is deﬁned as

f(x) =

1 Z

exp−1 2

(x−µ)Σ−1(x−µ),

where

Z =Zx∈Rd exp−1 2

(x−µ)Σ−1(x−µ)dx

= (2π)d/2|Σ|1/2, where|Σ|is the determinant of the covariance matrix. Let us assume that the covariance matrixΣis a diagonal matrix, as below:

Σ=

σ2 1 0 ··· 00 σ2 2 ··· 0. . . 0 ··· 0 . . . . . . ··· . . . 0 0 ··· σ2 d

.

The probability density function simpliﬁes to

f(x) =

d ∏ i=1

1 √2πσi exp−1 2

1 σ2 i

(xi−µi)2.

Show that this is indeed true.

1

2.

(a) Show that the following equation, called Bayes’ rule, is true.

p(Y|X) =

p(X|Y)p(Y) p(X)

.

(b) We learned the deﬁnition of expectation: E[X] = ∑ x∈Ω

xp(x).

Assuming that X andY are discrete random variables, show that

E[X +Y] =E[X]+E[Y].

(c) Further assume that c∈Ris a scalar and is not a random variable, show that E[cX] = cE[X].

(d) We learned the deﬁnition of variance: Var(X) = ∑ x∈Ω

(x−E[X])2p(x).

Assuming X being a discrete random variable, show that Var(X) =EX2−(E[X])2.

2

3. An optimal linear regression machine (without any regularization term), that minimizes the empirical cost function given a training set Dtra ={(x1,y∗ 1),…,(xN,y∗ N)}, can be found directly without any gradient-based optimization algorithm. Assuming that the distance function is deﬁned as

D(M∗(x),M,x) =

1 2kM∗(x)−M(x)k2 2 =

1 2

q ∑

k=1 (y∗ k−yk)2, derive the optimal weight matrixW. (Hint: Moore–Penrose pseudoinverse)

3

4. SupposethatwehaveadatadistributionY = f(X)+ε,whereXisarandomvector, ε is an independent random variable with zero mean and ﬁxed but unknown variance σ2, and f is an unknown deterministic function that maps a vector into a scalar. Now, we wish to approximate f(x) with our own model ˆ f(x;Θ) with some learnable parametersΘ. (a) Show that considering all possible ˆ f andΘ, the minimum of L2 loss EX[(Y− ˆ f(X;Θ))2] is achieved when for allx, ˆ f(x;Θ) = f(x)

(Hint: ﬁnd the minimum of L2 loss for a single example ﬁrst.) (b) If we train the same model varying initializations and examples from the underlyingdatadistribution,wemayendupwithdifferentΘ. Sowecanalsoconsider Θas a random variable if we ﬁx ˆ f. Showthatforasingleunseeninputvectorx0 andaﬁxed ˆ f,theexpectedsquared errorbetweenthegroundtruthy0 = f(x0)+ε andtheprediction ˆ f(x0;Θ) canbe decomposed into: E[(y0− ˆ f(x0;Θ))2] =E[f(x0)− ˆ f(x0;Θ)]2+Var[ ˆ f(x0;Θ)]+σ2 (Side note: this is usually known as the bias-variance decomposition, closely related to bias-variance tradeoff, and other concepts such as underﬁtting and overﬁtting.)