ISYE 6501 Week 7 HW Latest Update

Document Content and Description Below

Question 10.1 Using the same crime data set uscrime.txt as in Questions 8.2 and 9.1, find the best model you can using (a) a regression tree model, and (b) a random forest model. In R, you can use the tree package or the rpart package, and the randomForest package. For each model, describe one or two qualitative takeaways you get from analyzing the results (i.e., don’t just stop when you have a good model, but interpret it too). regression tree model As by now we know that the dataset contains only 47 points, for the regression tree model it might be hard to produce many splits or it might end up overfitting and we won’t be able to say for sure that the model would work as effectively with a large dataset. For this classification tree, I did not split the data in training and validation, rather used all the datapoints to create the model. The initial model used "Po1" "Pop" "LF" "NW" , the Residual mean deviance was 47390. This tree had 7 terminal nodes and looked as below – In the next step I pruned this tree with 6 , 4, 4,3 and 2 leaf nodes to look at the residual mean deviances, which kept increasing as I dropped a node. It might seem like leaf nodes = 7 is the best fit model, but because of a very small sample set this is overfitted. To solve this issue I chose to apply cross validation. cv.tree is shows a cross-validated version of the model. Instead of computing the deviance on the full training data, it uses cross-validated values for each of the 6 successive prunings. We can compare the ISYE 6501 Week 7 HW deviance in the outputs of just using prune.tree with the cross validated deviance and see that the crossvalidated values are rather higher at every step. Just using prune.tree tests on the training data and so under-reports the deviance. The cv values are more realistic. My random cross validation revealed that even for leafnode = 6 the RMSE is very close to that of 7. So I chose to prune the tree with 6 leaf nodes and then calculated the R2 of both unpruned and pruned models which happened to be very close to each other, withing .72 - .7 range. If the cross validation sampling were done differently, we could get minimum RMSE for some # of leaf nodes, and similarly the regression tree model with “limited” training data may become overfitted. Takeaway – The model shows that po1 is the first variable on which the first split happens and possibly LF is least important one as in the prunes tree this gets dropped first. It also shows that NW is probably more important the Pop as in the same brunch, pruning removed Pop. But kept NW. random forest model For deciding the NodeSize and mtry of the random forest model I created a loop for node size 2 to 15 and mtry values between 1 to 10 and charted their R square values to find the optimal numbers and found that mtry=3 and NOdeSize = 3 gave the highrest R sqr = 0.4551208 I applied these values to create the model and Looked at the importance of the variables in the model. ISYE 6501 Week 7 HW Takeaway – The random forest used more number of variables as compared to the regression tree, but did not produce better R sqr values. Possibly it’s because we don’t have enough sample of data for using this method and most of the trees were very similar to each other. From the charts we can see that it seems like increased the number of variables used in ‘sampling and split’ is actually decreasing the accuracy of this model. Question 10.2 Describe a situation or problem from your job, everyday life, current events, etc., for which a logistic regression model would be appropriate. List some (up to 5) predictors that you might use. While sending out targeted emails with offers, our marketing team at a leading automotive company would do a logistic regression modelling to determine the types of email flyers offers certain groups of customers would enact to. The predictors that could be used are – Customer age group, Types od Car they own, age of car, frequency of services availed at dealership, past offer redemption types etc. Based on these customer segmentations and created and the emails are formatted accordingly through sales force. Once the recipients click through them and we get back the sales and service data from dealerships, they constitute back to the model for further adjustments. Question 10.3 1. Using the GermanCredit data set germancredit.txt from http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german / (description at http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29 ), use logistic regression to find a good predictive model for whether credit applicants are good credit risks or not. Show your model (factors used and their coefficients), the software output, and the quality of fit. You can use the glm function in R. To get a logistic regression (logit) model on data where the response is either zero or one, use family=binomial(link=”logit”) in your glm function call.

[Show More]

Last updated: 3 months ago

Preview 1 out of 44 pages

Buy Now

Instant download

Preview of ISYE 6501 Week 7 HW Latest Update

Buy this Document to get the Full Access Instantly

Provided by Students Who Aced it

We Verify Document Content to Gurantee Accuracy

Buy Now

Report Copyright Violation

Also available in bundle (1)

Click Below to Access Bundle(s)

BUNDLED PAPERS (Multiple versions) FOR Georgia Institute Of Technology ISYE 6501 Homeworks 1 - 15, Midterm 1 & 2 + FINAL EXAM | ISYE6501x Courseware | edX - Complete Solutions - Introduction To Analytics Modeling - GTX ISYE 6501

GTx: ISYE6501x Introduction to Analytics Modeling Midterm Quiz 2 - GT Students and Verified MM Learners latest 2021 Midterm Quiz 1 - GT Students (Launch Proctortrack first before taking the Midterm Qu...

By Nutmegs 4 years ago

$15