CS 6140 Assignment 4 solution

$28.00

Category:

Description

1 Perceptron and Dual Perceptron
1.1 Data and Pre-processing
• PerceptronData: This is a binary classification dataset consisting of four features and the
classes are linearly separable.
• Two Spirals: This dataset has two features and it is non-linearly separable (Figure-1).
1.2 Implementation
1. Implement the perceptron algorithm and test it on the PerceptronData dataset using ten fold
cross validation.
2. Implement the dual perceptron algorithm and test it on the PerceptronData dataset using
ten fold cross validation.
3. Compare the performance of the two algorithms on the PerceptronData dataset and make
sure that they have (almost) identical performance.
Figure 1: The two spirals dataset, the colors represent class labels.
1
1.3 Kernelizing Dual Perceptron
1. Run the dual perceptron with the linear kernel on the Two Spiral dataset and show that the
data is not separable using ten-fold cross validation.
2. Run the dual perceptron with the Gaussian (RBF) kernel on the Two Spiral dataset and
show that the data is separable using ten-fold cross validation.
Note: You would need to select an appropriate value for the scale γ of the RBF kernel. Design a
search strategy to set the value of γ that searches for possible values of γ ∈ [0, 0.25].) For both the
perceptron and dual perceptron be sure to set an upper limit on the number of iterations.
2 Regularized Logistic Regression
In this question you will develop a regularized version of Logistic regression. Let (xi
, yi), i ∈
{1, 2, . . . , N} represent the training data where xi ∈ R
d and yi ∈ {−1, 1}. Under the logistic
regression model, the probability of an instance belonging to either class is given as:
P(yi
|w, xi) = 1
1 + e−yiwT xi
(1)
where, w are the model parameters. The maximum likelihood solution can be obtained by minimizing the negative log-likelihood. For the regularized case, we add an L2 penalty on the norm of
w which results in the objective function for regularized Logistic regression:
min
w
λ
2
kwk
2 +
X
N
i=1
ln(1 + e
−yiwT xi
) (2)
1. Do you think that w0 should be included in the regularization?
2. Calculate the gradient of the objective with respect to the model parameters.
3. Develop a gradient descent algorithm for learning the parameters from given training data.
4. Contrast the performance of Logistic Regression with Regularized Logistic Regression for the
Spambase, Diabetes and Breast Cancer datasets using ten fold cross validation.
5. Is it possible to kernelize Equation-2? If yes, then what would be the dual objective function
for regularized logistic regression?
Note: If you google Regularized Logistic Regression, Andrew Ng has a video lecture that details
the development of Regularized Logistic Regression from simple Logistic Regression. I would recommend you to watch the video after you have spent some time trying to develop the solution
yourself.
2
3 Determining Model Hyper-parameters:
In this section we will look at a strategy for setting model hyper-parameters. An example of a
model hyper-parameter is λ in Equation 2 that we need to set before we can train and assess the
performance of logistic regression on a given dataset. Most of the models we have worked with
require only one hyper-parameter and therefore we test different values of the parameter over a fixed
range. This is known as grid search, beacuse we are gridding the possible values of a parameter
and then selecting a value that gives us the best peformance on our validation set. But, what if our
model requires multiple hyper-parameters to be set prior to training. In this case, we cannot treat
each parameter as independent and need to try different combinations of all hyper-parameters.
Usually, we use k-fold cross-validation to assess the performance of our model. But when we
also have to determine model hyper-parameters, we need to select the best values using only our
training data, and not include the test data while searching for the hyper-parameters as this will
cause “leakage” where the training phase has information about the test data which will bias our
results (the results will be optimistic).
In order to facilitate this, we use a nested cross-validation strategy. In this strategy we will
partition our training data (determined through the outer k-fold cross-validation loop) further into
m-folds (most commonly used values are: k = 10, m = 5). For each iteration of the inner crossvalidation we train on m − 1 folds, and try different combinations of model hyper-parameters, and
select the combination that gives us the highest performance i.e., recall, precision or accuracy (ties
can be broken randomly) on the internal test set. We will then use this combination to train a
model using all m folds (original training data produced by the outer cross-validation loop), and
test its performance on the outer test set. Since, we have m-fold cross-validation nested within a
k-fold cross-validation, therefore we call this strategy nested cross-validation.
In this question you will be working with libSVM or the SVM classifier in sklearn. You will
be using the linear and RBF kernels, and will need to set two model-hyperparamters (for the RBF
kernel) i.e., γ for the RBF kernel, and C the cost of misclassification for the SVM. Determine a good
range for both hyperparamters, usually these are set as powers of two i.e., C ∈ {2
−5
, 2
−4
, …., 2
10}
and γ ∈ {2
−15
, 2
−14
, …., 2
5}, you can adjust these ranges as you think appropriate for the given
datasets.
3.1 Deliverables:
1. Report the training and test perfomance of both kernels on the Diabetes, Breast Cancer, and
Spambase datasets in a table (include mean and standard deviation of recall, precision, and
accuracy). List the best hyper-parameters that you found by optimizing the accuracy over
your parameter grid, for each fold.
2. Repeat the grid search in the first part but instead of optimizing accuracy, optimize AUC.
Note: In order calculate AUC will need the SVM to predict class probabilities, instead of
predicting labels.
3. For each dataset, provide a ROC-AUC curve across all k-folds.
3
4 SVMs vs Multiclass Problems:
Since SVMs are binary classifiers, how do they deal with multiclass dataset? We will design a
1-vs-all strategy where we train m models for an m-class multiclass problem. In this strategy if
we have three classes we will train three models: class-1 (positive) vs all other classes (negative),
class-2 (positive) vs all other classes (negative), and class-3 (positive) vs all other classes (negative).
Each of these three problems will be solved independently.
4.1 Deliverables:
1. Report the training and test perfomance of both kernels on the Wine dataset in a table
(include mean and standard deviation of recall, precision, and accuracy per class). List the
best hyper-parameters that you found by optimizing the accuracy over your parameter grid,
for each fold per class.
2. Provide a ROC-AUC curve for each class, across all k-folds.
3. Extra-credit: Download the digits dataset which is an OCR dataset i.e., recognition of handwritten digits from USPS. It has 256 features corresponding to pixels from a 16 x 16 image
of digits. Use a linear and an RBF kernel based SVM model, and report the recall, precision
and accuracy per class using cross-validation to set hyper-parameters. Provide ROC-AUC
curves for every class separtely. (Note: the dataset is pre-partitioned into a training and a
test set, so you don’t have to do nested cross-validation (the outer cross-validation loop).)
4