Engineering > QUESTIONS & ANSWERS > ISYE6501 HOMEWORK 10 LATEST UPDATE (All)
ISYE6501 HOMEWORK 10 Question 14.1 The breast cancer data set breast-cancer-wisconsin.data.txt from http://archive.ics.uci.edu/ml/ machine-learning-databases/breast-cancer-wisconsin/ (description a ... t http://archive.ics.uci.edu/ml/ datasets/Breast+Cancer+Wisconsin+%28Original%29 ) has missing values. 1. Use the mean/mode imputation method to impute values for the missing data. 2. Use regression to impute values for the missing data. 3. Use regression with perturbation to impute values for the missing data. 4. (Optional) Compare the results and quality of classification models (e.g., SVM, KNN) build using (1) the data sets from questions 1,2,3; (2) the data that remains after data points with missing values are removed; and (3) the data set when a binary variable is introduced to indicate missing values. Load breast-cancer-wisconsin.data.txt zipfile = paste(getwd(),'/data 14.1.zip', sep ="") zip = unzip(zipfile) data <- read.csv(zip[1], header = FALSE, stringsAsFactors = FALSE ,sep = ",") number of Rows: 699 number of Columns: 11 1. Sample code number: id number 2. Clump Thickness: 1 - 10 3. Uniformity of Cell Size: 1 - 10 4. Uniformity of Cell Shape: 1 - 10 5. Marginal Adhesion: 1 - 10 6. Single Epithelial Cell Size: 1 - 10 7. Bare Nuclei: 1 - 10 8. Bland Chromatin: 1 - 10 9. Normal Nucleoli: 1 - 10 10. Mitoses: 1 - 10 11. Class: (2 for benign, 4 for malignant) Breast Cancer Wisconsin Data - Preview top 5 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 1000025 5 1 1 1 2 1 3 1 1 2 1002945 5 4 4 5 7 10 3 2 1 2 1015425 3 1 1 1 2 2 3 1 1 2 1016277 6 8 8 1 3 4 3 7 1 2 1017023 4 1 1 3 2 1 3 1 1 2 Find the missing values Only V7 column has missing values identified with the question mark ?. 2.28% of data points have missing data. So that’s an acceptable rate to perform some imputation of the missing data. missingvalues<-which(data$V7 == '?') # 2.28% of missing data ratio_missingvalues<-length(missingvalues)/nrow(data) #data set with missing data only data_missingvalues<-data[missingvalues,] #data set without missing data 1 data_no_missingvalues<-data[-missingvalues,] data_no_missingvalues$V7 = as.integer(data_no_missingvalues$V7) Does the missing values have any bias? Since the goal is to predict V11, One quick way to verify whether the missing values are carrying any bias, is to compare the distribution of V11’s values in the population with missing values and the distribution in the population without missing values. V11 = 2 represent 87.5% of the missing value data set versus 65% in the data set without missing value. So we might conclude that the missing value population might have some bias. #0.6552217 sum(data$V11 == 2)/nrow(data) ## [1] 0.6552217 #0.875 sum(data_missingvalues$V11 == 2)/nrow(data_missingvalues) ## [1] 0.875 #0.6500732 sum(data_no_missingvalues$V11 == 2)/nrow(data_no_missingvalues) ## [1] 0.6500732 1. Use the mean/mode imputation method to impute values for the missing data. Mode imputation is prefered for categorical variables. In our case, V7 is apparently an ordinal variable so we might consider to treat it as a categorical or continuous variable. I will use both methods mean and mode imputations, and I will create one column for the mean imputation and one column for the mode imputation to imputethe missing values of V7. One quick comment: taking the mean might give us a float; I arbitrary rounding the value to the nearest integer. # Compute the mean on the non missing values of V7 mean_V7<-round(mean(as.integer(data_no_missingvalues$V7))) #mean: 3.544656 => mean_V7: 4 new_data<-data new_data$mean<-new_data$V7 new_data$mean[new_data$mean== "?"] <- round(mean_V7) new_data$mean<-as.numeric(new_data$mean) # Create the function. getmode <- function(v) { uniqv <- unique(v) uniqv[which.max(tabulate(match(v, uniqv)))] } # Calculate the mode using the user function. #mode_V7 = 1 mode_V7 <- getmode(as.integer(data_no_missingvalues$V7)) [Show More]
Last updated: 3 years ago
Preview 1 out of 10 pages
Buy this document to get the full access instantly
Instant Download Access after purchase
Buy NowInstant download
We Accept:
GTx: ISYE6501x Introduction to Analytics Modeling Midterm Quiz 2 - GT Students and Verified MM Learners latest 2021 Midterm Quiz 1 - GT Students (Launch Proctortrack first before taking the Midterm Qu...
By Nutmegs 3 years ago
$15
66
Can't find what you want? Try our AI powered Search
Connected school, study & course
About the document
Uploaded On
May 19, 2022
Number of pages
10
Written in
All
This document has been written for:
Uploaded
May 19, 2022
Downloads
0
Views
177
Scholarfriends.com Online Platform by Browsegrades Inc. 651N South Broad St, Middletown DE. United States.
We're available through e-mail, Twitter, Facebook, and live chat.
FAQ
Questions? Leave a message!
Copyright © Scholarfriends · High quality services·