ISYE 6501 9/5/2019 Homework 2, Georgia Tech, Graded A+

Document Content and Description Below

ISYE 6501 9/5/2019 Homework 2 Question 3.1: Using the same data set (credit_card_data.txt or credit_card_data-headers.txt) as in Question 2.2, use the ksvm or kknn function to find a good classifier: (a) using cross-validation (do this for the k-nearest-neighbors model; SVM is optional); Answer: To approach cross-validation for the KNN model, I attempted to leverage the kknn’s built-in cross-validation function: train.knn. This function uses leave-one-out cross-validation which means that every data point has a model fit to all other data points. LOOCV is essentially doing cross-validation X number of times where X is the number of data points. In my approach, I decided to test 20 values of K (just as in week 1) except this time I introduced a Kmax variable for robustness of code (easy to change to test more values). I once again scaled the data for a better fit. The total number of correctly predicted data points, were divided by the total number of data points to give a percentage of correct points for each K value. The results indicated that low values of K performed the worst (i.e K=1 is only 81% accurate) and that K>5 tends to produce fairly similar results with peaks around K=12 and K=15-17 at 85.3% accuracy. It is important to note that despite a decently high predictive value here, our model quality might not necessarily be high and we would need extraneous data outside of our train/validation set to confirm. To approach cross-validation for the SVM model, I simply found the argument parameter cross to include at the end of my line call for the KSVM function from last week’s homework and set that equal to 20 to do 20x cross validation of the data. I also found the call: model@cross which shows the error measured by cross-validation, and hence could use that to determine the accuracy of this model. Finally I used a for loop with various values of C in order to test the accuracy of each one (I tested 10 values from 0.000001 to 1000 by magnitudes of 10). As noted in last week’s homework, larger values of C tend to give better results but reaching a saturation point around C=0.01 of about 86.2% accuracy. Code: KNN rm(list = ls()) install.packages("kknn") library(kknn) data <- read.table("/Users/Vikram/Downloads/credit_card_data.txt", stringsAsFactors = FALSE, header = FALSE) head(data) set.seed(1) kmax <- 20 model <- train.kknn(V11~.,data,kmax=kmax,scale=TRUE) accuracy <- rep(0,kmax) for (k in 1:kmax) { This study source was downloaded by 100000842525582 from CourseHero.com on 05-13-2022 05:04:36 GMT -05:00 https://www.coursehero.com/file/50101369/GTech-Homework-2/ISYE 6501 9/5/2019 predicted <- as.integer(fitted(model)[[k]][1:nrow(data)] + 0.5) accuracy[k] <- sum(predicted == data$V11) / nrow(data) } accuracy SVM library(kernlab) data <- read.table("/Users/Vikram/Downloads/credit_card_data.txt", stringsAsFactors = FALSE, header = FALSE) head(data) set.seed(1) acc<-c(0,0,0,0,0,0,0,0,0,0) C <- c(0.000001,0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000) for (i in 1:10) { model <- ksvm(as.matrix(data[,1 :10]),as.factor(data[,11 ]), type = "C-svc", kernel = "vanilladot", C = C[i], scaled=TRUE, cross=20) acc[i]<-1-model@cross } acc[1:10] OUTPUT KNN [1] 0.8149847 0.8149847 0.8149847 0.8149847 0.8516820 0.8455657 0.8470948 [8] 0.8486239 0.8470948 0.8516820 0.8516820 0.8532110 0.8516820 0.8516820 [15] 0.8532110 0.8532110 0.8532110 0.8516820 0.8501529 0.8501529 SVM [1] 0.5478693 0.5473485 0.5476326 0.8319602 0.8625000 0.8624527 0.8627841 [8] 0.8621212 0.8623106 0.8623106 (b) splitting the data into training, validation, and test data sets (pick either KNN or SVM; the other is optional). Answer: I elected to test both methods (KNN and SVM). My methodology was to first split the data into the three different sets stated in the question: training, validation, and test sets. I chose an arbitrary split of 70%, 15%, 15% for the data (A near limitless number of different splits could have been used here that would produce slightly different answers, if I had more time I would elect to try to write code that would compare all different splits and use recursive analysis to further optimize cross validation through the use of the optimal split.) Next, I fit all of the models I have been experimenting with both in this week and last week’s homework assignments to the training data.This includes 10 SVM models (0.000001 to 1000 by magnitudes of 10) ad 20 KNN models (K=1 to K=20) for a total of 30 models. Then, I evaluated these models on the validation This study source was downloaded by 100000842525582 from CourseHero.com on 05-13-2022 05:04:36 GMT -05:00 https://www.coursehero.com/file/50101369/GTech-Homework-2/ISYE 6501 9/5/2019 data. C=0.01 was the best SVM model and K=17 was the best KNN model. It should be noted that several values were close and hence the selection of best is minute and slightly arbitrary, the general principle is that K>5 and C>0.01 are the best models. The final step of the code is a conditional evaluation to check and report whether the SVM or the KNN model is better overall. This is included so that this generalized code could be applied to other broader sets where there might be a different solution. For this assignment the SVM 0.01 model was best with a 0.867 test accuracy. It should be noted that several KNN models actually outperformed this SVM model on the test set, but that they fell short in validation. Using the test data to validate instead of the validation data would defeat the entire purpose of this exercise and introduce selection bias. Code: rm(list = ls()) library(kernlab) install.packages("kknn") library(kknn) data <- read.table("/Users/Vikram/Downloads/credit_card_data.txt", stringsAsFactors = FALSE, header = FALSE) head(data) set.seed(1) filter = sample(nrow(data), size = floor(nrow(data) * 0.7)) train_set= data[filter,] non_train = data[-filter, ] filter_val = sample(nrow(non_train), size = floor(nrow(non_train)/2)) val_set = non_train[filter_val,] test_set = non_train[-filter_val, ] acc <- rep(0,30) C <- c(0.000001,0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000) for (i in 1:10) {model <- ksvm(as.matrix(train_set[,1 :10]),as.factor(train_set[,11 ]), type = "C-svc", kernel = "vanilladot", C = C[i], scaled=TRUE) pred <- predict(model,val_set[,1:10]) acc[i] = sum(pred == val_set$V11) / nrow(val_set) } acc[1:10] cat("The Best SVM Model is ", C[which.max(acc[1:10])], " with ", max(acc[1:10])) model <- ksvm(as.matrix(train_set[,1 :10]),as.factor(train_set[,11 ]), type = "C-svc", kernel

[Show More]

Last updated: 3 years ago

Preview 1 out of 7 pages

Buy Now

Instant download

We Accept: