Question 10.1
Using the same crime data set uscrime.txt as in Questions 8.2 and 9.1, find the best model you
can using
(a) a regression tree model, and
(b) a random forest model.
In R, you can use the tree package o
...
Question 10.1
Using the same crime data set uscrime.txt as in Questions 8.2 and 9.1, find the best model you
can using
(a) a regression tree model, and
(b) a random forest model.
In R, you can use the tree package or the rpart package, and the randomForest package. For
each model, describe one or two qualitative takeaways you get from analyzing the results (i.e., don’t
just stop when you have a good model, but interpret it too).
regression tree model
As by now we know that the dataset contains only 47 points, for the regression tree model it might be
hard to produce many splits or it might end up overfitting and we won’t be able to say for sure that the
model would work as effectively with a large dataset. For this classification tree, I did not split the data
in training and validation, rather used all the datapoints to create the model. The initial model used
"Po1" "Pop" "LF" "NW" , the Residual mean deviance was 47390. This tree had 7 terminal nodes and
looked as below –
In the next step I pruned this tree with 6 , 4, 4,3 and 2 leaf nodes to look at the residual mean deviances,
which kept increasing as I dropped a node. It might seem like leaf nodes = 7 is the best fit model, but
because of a very small sample set this is overfitted. To solve this issue I chose to apply cross validation.
cv.tree is shows a cross-validated version of the model. Instead of computing the deviance on the full
training data, it uses cross-validated values for each of the 6 successive prunings. We can compare the
ISYE 6501 Week 7 HW
deviance in the outputs of just using prune.tree with the cross validated deviance and see that the crossvalidated values are rather higher at every step. Just using prune.tree tests on the training data and so
under-reports the deviance. The cv values are more realistic.
My random cross validation revealed that even for leafnode = 6 the RMSE is very close to that of 7. So I
chose to prune the tree with 6 leaf nodes and then calculated the R2 of both unpruned and pruned
models which happened to be very close to each other, withing .72 - .7 range.
If the cross validation sampling were done differently, we could get minimum RMSE for some # of leaf
nodes, and similarly the regression tree model with “limited” training data may become overfitted.
Takeaway –
The model shows that po1 is the first variable on which the first split happens and possibly LF is least
important one as in the prunes tree this gets dropped first. It also shows that NW is probably more
important the Pop as in the same brunch, pruning removed Pop. But kept NW.
random forest model
For deciding the NodeSize and mtry of the random forest model I created a loop for node size 2 to 15
and mtry values between 1 to 10 and charted their R square values to find the optimal numbers and
found that mtry=3 and NOdeSize = 3 gave the highrest R sqr = 0.4551208
I applied these values to create the model and
Looked at the importance of the variables in the model.
ISYE 6501 Week 7 HW
Takeaway –
The random forest used more number of variables as compared to the regression tree, but did not
produce better R sqr values. Possibly it’s because we don’t have enough sample of data for using this
method and most of the trees were very similar to each other. From the charts we can see that it seems
like increased the number of variables used in ‘sampling and split’ is actually decreasing the accuracy of
this model.
Question 10.2
Describe a situation or problem from your job, everyday life, current events, etc., for which a logistic
regression model would be appropriate. List some (up to 5) predictors that you might use.
While sending out targeted emails with offers, our marketing team at a leading automotive company
would do a logistic regression modelling to determine the types of email flyers offers certain groups of
customers would enact to. The predictors that could be used are – Customer age group, Types od Car
they own, age of car, frequency of services availed at dealership, past offer redemption types etc. Based
on these customer segmentations and created and the emails are formatted accordingly through sales
force. Once the recipients click through them and we get back the sales and service data from
dealerships, they constitute back to the model for further adjustments.
Question 10.3
1. Using the GermanCredit data set germancredit.txt from
http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german / (description at
http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29 ), use logistic
regression to find a good predictive model for whether credit applicants are good credit risks
or not. Show your model (factors used and their coefficients), the software output, and the
quality of fit. You can use the glm function in R. To get a logistic regression (logit) model on
data where the response is either zero or one, use family=binomial(link=”logit”)
in your glm function call.
[Show More]