ISYE 6501 WEEK 7 HOMEWORK – SAMPLE SOLUTIONS

Document Content and Description Below

WEEK 7 HOMEWORK – SAMPLE SOLUTIONS IMPORTANT NOTE These homework solutions show multiple approaches and some optional extensions for most of the questions in the assignment. You don’t need to submit all this in your assignments; they’re included here just to help you learn more – because remember, the main goal of the homework assignments, and of the entire course, is to help you learn as much as you can, and develop your analytics skills as much as possible! Question 1 Using the same crime data set as in Homework 5 Question 2, find the best model you can using (a) a regression tree model, and (b) a random forest model. In R, you can use the tree package or the rpart package, and the randomForest package. For each model, describe one or two qualitative takeaways you get from analyzing the results (i.e., don’t just stop when you have a good model, but interpret it too). (a) Regression Tree Here’s one possible solution. Please note that a good solution doesn’t have to try all of the possibilities in the code; they’re shown to help you learn, but they’re not necessary. The file HW7-Q1a-fall.R shows how to build the regression tree. It shows two different approaches: one using part of the data for training and part for testing, and one uses all of the data for training (because we have only 47 data points). A visualization of one of the trees is below. The tree in the figure (sometimes called a “dendro gram”) shows that four factors are used in branching: Po1, Pop, NW, and LF. Notice that Pop is used in two places in the tree. Also notice that following the rightmost branches down the tree, Po1 is used twice in the branching: once at the top, and then again lower down. It turns out that the model is overfit (see the R file). The R file shows a lot more possible models, including pruning to smaller trees, using regressions for each leaf instead of averages, etc. The models show that Po1 is the primary branching factor, and when Po1 < 7.65, we can build a regression model with PCA that can account for about 30% of the variability. But for Po1 > 7.65, we don’t have a good linear regression model; none of the factors are significant. This shows that we would need to either try other types of models, or find new explanatory factors. (b) Random Forest The file HW7-Q1b-fall.R shows how to run a random forest model for this same problem. In the lessons, we saw that the random forest process avoids some of the potential for overfitting. And (as the R code shows) , cross-validation shows that it works better than the previous models we’ve found for this data set. The R code also shows the same sort of qualitative behavior as we saw from the regression tree model. The random forest model too thinks that Po1 is the most important predictive factor for Crime. And, as | Po1 < 7.65 Pop < 22.5 LF < 0.5675 NW < 7.65 Pop < 21.5 Po1 < 9.65 500 700 800 1000 700 1000 2000 we saw from the regression tree, the random forest model gives better predictive quality for data points where Po1 < 7.65 than it does for Po1 > 7.65. But unlike the regression tree, the random forest does have predictive value even for Po1 > 7.65. Question 2 Describe a situation or problem from your job, everyday life, current events, etc., for which a logistic regression model would be appropriate. List some (up to 5) predictors that you might use. Here’s one potential suggestion. In business-to-business sales and marketing, buying-propensity models can be useful in ranking customers as high value or low value. A logistic regression model could be used to predict the probability of a customer buying or not, or (using a threshold) to classify into yes or no categories. Some of the predictors that could be used to develop such a model are: 1. Size of the customer in terms of revenue/year; 2. Average yearly spend (for the last 3 years) in the relevant product category; 3. The rating of the competitor(s) from where the customer currently buys the product (a market leader would have a higher rating as compared to a new entrant); 4. The number of years associated with the customer (is the customer long-standing? Have they bought from us for the last 5/10 years or is it a new relationship?); 5. Industry rankings for the product by various rating agencies (e.g., Gartner). Using this model, sales teams can rationalize their efforts and concentrate their efforts on customers that are more likely to buy, rather than going broad and losing focus while catering to all customers. Question 3 1. Using the GermanCredit data set at http://archive.ics.uci.edu/ml/machine-learningdatabases/statlog/german / (description at http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29 ), use logistic regression to find a good predictive model for whether credit applicants are good credit risks or not. Show your model (factors used and their coefficients), the software output, and the quality of fit. You can use the glm function in R. To get a logistic regression (logit) model on data where the response is either zero or one, use family=binomial(link=”logit”) in your glm function call. Here’s one possible solution. Please note that a good solution doesn’t have to try all of the possibilities in the code; they’re shown to help you learn, but they’re not necessary. The file HW7-Q3-fall.R shows a possible approach. The data includes a lot of categorical variables, and for some of them, not all of the values are significant. So, as the R code shows, we have new variables for each value of each categorical variable that’s significant. Here’s a table of significant variables and their coefficients for the model (which has AIC = 673.5):

[Show More]

Last updated: 3 years ago

Preview 1 out of 6 pages

Buy Now

Instant download

Preview image of ISYE 6501 WEEK 7 HOMEWORK – SAMPLE SOLUTIONS document

Buy this document to get the full access instantly

Instant Download Access after purchase

Buy Now

Instant download

We Accept:

Report Copyright Violation

Also available in bundle (1)

Click Below to Access Bundle(s)

BUNDLED PAPERS (Multiple versions) FOR Georgia Institute Of Technology ISYE 6501 Homeworks 1 - 15, Midterm 1 & 2 + FINAL EXAM | ISYE6501x Courseware | edX - Complete Solutions - Introduction To Analytics Modeling - GTX ISYE 6501

GTx: ISYE6501x Introduction to Analytics Modeling Midterm Quiz 2 - GT Students and Verified MM Learners latest 2021 Midterm Quiz 1 - GT Students (Launch Proctortrack first before taking the Midterm Qu...

By Nutmegs 3 years ago

$15