Engineering  >  QUESTIONS & ANSWERS  >  ISYE6501 HOMEWORK 10 LATEST UPDATE (All)

ISYE6501 HOMEWORK 10 LATEST UPDATE

Document Content and Description Below

ISYE6501 HOMEWORK 10 Question 14.1 The breast cancer data set breast-cancer-wisconsin.data.txt from http://archive.ics.uci.edu/ml/ machine-learning-databases/breast-cancer-wisconsin/ (description a ... t http://archive.ics.uci.edu/ml/ datasets/Breast+Cancer+Wisconsin+%28Original%29 ) has missing values. 1. Use the mean/mode imputation method to impute values for the missing data. 2. Use regression to impute values for the missing data. 3. Use regression with perturbation to impute values for the missing data. 4. (Optional) Compare the results and quality of classification models (e.g., SVM, KNN) build using (1) the data sets from questions 1,2,3; (2) the data that remains after data points with missing values are removed; and (3) the data set when a binary variable is introduced to indicate missing values. Load breast-cancer-wisconsin.data.txt zipfile = paste(getwd(),'/data 14.1.zip', sep ="") zip = unzip(zipfile) data <- read.csv(zip[1], header = FALSE, stringsAsFactors = FALSE ,sep = ",") number of Rows: 699 number of Columns: 11 1. Sample code number: id number 2. Clump Thickness: 1 - 10 3. Uniformity of Cell Size: 1 - 10 4. Uniformity of Cell Shape: 1 - 10 5. Marginal Adhesion: 1 - 10 6. Single Epithelial Cell Size: 1 - 10 7. Bare Nuclei: 1 - 10 8. Bland Chromatin: 1 - 10 9. Normal Nucleoli: 1 - 10 10. Mitoses: 1 - 10 11. Class: (2 for benign, 4 for malignant) Breast Cancer Wisconsin Data - Preview top 5 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 1000025 5 1 1 1 2 1 3 1 1 2 1002945 5 4 4 5 7 10 3 2 1 2 1015425 3 1 1 1 2 2 3 1 1 2 1016277 6 8 8 1 3 4 3 7 1 2 1017023 4 1 1 3 2 1 3 1 1 2 Find the missing values Only V7 column has missing values identified with the question mark ?. 2.28% of data points have missing data. So that’s an acceptable rate to perform some imputation of the missing data. missingvalues<-which(data$V7 == '?') # 2.28% of missing data ratio_missingvalues<-length(missingvalues)/nrow(data) #data set with missing data only data_missingvalues<-data[missingvalues,] #data set without missing data 1 data_no_missingvalues<-data[-missingvalues,] data_no_missingvalues$V7 = as.integer(data_no_missingvalues$V7) Does the missing values have any bias? Since the goal is to predict V11, One quick way to verify whether the missing values are carrying any bias, is to compare the distribution of V11’s values in the population with missing values and the distribution in the population without missing values. V11 = 2 represent 87.5% of the missing value data set versus 65% in the data set without missing value. So we might conclude that the missing value population might have some bias. #0.6552217 sum(data$V11 == 2)/nrow(data) ## [1] 0.6552217 #0.875 sum(data_missingvalues$V11 == 2)/nrow(data_missingvalues) ## [1] 0.875 #0.6500732 sum(data_no_missingvalues$V11 == 2)/nrow(data_no_missingvalues) ## [1] 0.6500732 1. Use the mean/mode imputation method to impute values for the missing data. Mode imputation is prefered for categorical variables. In our case, V7 is apparently an ordinal variable so we might consider to treat it as a categorical or continuous variable. I will use both methods mean and mode imputations, and I will create one column for the mean imputation and one column for the mode imputation to imputethe missing values of V7. One quick comment: taking the mean might give us a float; I arbitrary rounding the value to the nearest integer. # Compute the mean on the non missing values of V7 mean_V7<-round(mean(as.integer(data_no_missingvalues$V7))) #mean: 3.544656 => mean_V7: 4 new_data<-data new_data$mean<-new_data$V7 new_data$mean[new_data$mean== "?"] <- round(mean_V7) new_data$mean<-as.numeric(new_data$mean) # Create the function. getmode <- function(v) { uniqv <- unique(v) uniqv[which.max(tabulate(match(v, uniqv)))] } # Calculate the mode using the user function. #mode_V7 = 1 mode_V7 <- getmode(as.integer(data_no_missingvalues$V7)) [Show More]

Last updated: 3 years ago

Preview 1 out of 10 pages

Buy Now

Instant download

We Accept:

Payment methods accepted on Scholarfriends (We Accept)
Preview image of ISYE6501 HOMEWORK 10 LATEST UPDATE document

Buy this document to get the full access instantly

Instant Download Access after purchase

Buy Now

Instant download

We Accept:

Payment methods accepted on Scholarfriends (We Accept)

Reviews( 0 )

$9.00

Buy Now

We Accept:

Payment methods accepted on Scholarfriends (We Accept)

Instant download

Can't find what you want? Try our AI powered Search

177
0

Document information


Connected school, study & course


About the document


Uploaded On

May 19, 2022

Number of pages

10

Written in

All

Seller


Profile illustration for Nutmegs
Nutmegs

Member since 4 years

607 Documents Sold

Reviews Received
77
14
8
2
21
Additional information

This document has been written for:

Uploaded

May 19, 2022

Downloads

 0

Views

 177

Document Keyword Tags


$9.00
What is Scholarfriends

Scholarfriends.com Online Platform by Browsegrades Inc. 651N South Broad St, Middletown DE. United States.

We are here to help

We're available through e-mail, Twitter, Facebook, and live chat.
 FAQ
 Questions? Leave a message!

Follow us on
 Twitter

Copyright © Scholarfriends · High quality services·