Question 8.1
Describe a situation or problem from your job, everyday life, current events, etc., for which a linear
regression model would be appropriate. List some (up to 5) predictors that you might use.
Working in
...
Question 8.1
Describe a situation or problem from your job, everyday life, current events, etc., for which a linear
regression model would be appropriate. List some (up to 5) predictors that you might use.
Working in Canada’s biggest retail hardware store chain and building up sales analytics from the scratch,
we faced a problem with gathering store transactions data due to various legal reasons. Close to 30% of
our 1100+ stores across the country were initially very skeptical and reluctant on sharing their data
because of a dealer-owner cooperative business model. We were receiving sales data from around 750
stores and based on which we were building dashboards and taking business decisions around various
financial/merchandising/marketing goals. But we lacked the complete insight as a big chunk of data was
still unknown to us. We could speculate, but it was not good enough to rely upon. At this point, we
decided to build a regression model to predict what might the total sales $$ be of those ‘unknown’
stores based on criteria such as –
I. Store Area (in sqrft) – expected to be a +ve correlation
II. Primary LOB (hardware, building center, furniture etc.) – need to be converted to numeric values,
usually stores with builder centers have larger sales $$
III. Monthly avg temp of the area code (as sales could be seasonal) – helps when we are trying to
estimate monthly sales $$ for our monthly BI reports, or dealing with seasonality
Question 8.2
Using crime data from http://www.statsci.org/data/general/uscrime.txt (file uscrime.txt,
description at http://www.statsci.org/data/general/uscrime.html ), use regression (a useful R function is
lm or glm) to predict the observed crime rate in a city with the following data:
M = 14.0
So = 0
Ed = 10.0
Po1 = 12.0
Po2 = 15.5
LF = 0.640
M.F = 94.0
Pop = 150
NW = 1.1
U1 = 0.120
U2 = 3.6
Wealth = 3200
Ineq = 20.1
Prob = 0.04
Time = 39.0
Show your model (factors used and their coefficients), the software output, and the quality of fit.
ISYE 6501 Week 5 HW
Note that because there are only 47 data points and 15 predictors, you’ll probably notice some
overfitting. We’ll see ways of dealing with this sort of problem later in the course.
Ans –
The uscrime dataset is has number of offences per 10k population, this is a continuous dataset with a
set of possible “predictors” –
#Variable Description
#M percentage of males aged 14–24 in total state population
#So indicator variable for a southern state
#Ed mean years of schooling of the population aged 25 years or over
#Po1 per capita expenditure on police protection in 1960
#Po2 per capita expenditure on police protection in 1959
#LF labor force participation rate of civilian urban males in the age-group 14-24
#M.F number of males per 100 females
#Pop state population in 1960 in hundred thousand
#NW percentage of nonwhites in the population
#U1 unemployment rate of urban males 14–24
#U2 unemployment rate of urban males 35–39
#Wealth wealth: median value of transferable assets or family income
#Ineq income inequality: percentage of families earning below half the median income
#Prob probability of imprisonment: ratio of number of commitments to number of offenses
#Time average time in months served by offenders in state prisons before their first release
#Crime crime rate: number of offenses per 100,000 population in 1960
To understand more about the data, after loading it into a table, I looked at the data summary, looked at
the box plot to check any possible outliers. Although I have not removed any data point from the set for
this assignment’s purpose, I performed the test mostly for discovery, Crime values 1969 1674 1993
showed up at the highest 3 values outside the whiskers of the boxplot, using the grubbds test we
possibly could remove these outliers, but I skipped this step.
Later looked at the correlation matrix to check if any pair of variables are corelated to each other or not.
I found that there is a strong linear correlation between Po1 and Po2 with correlation coeff = .99. Also,
the Wealth and Ineq has a -ve correlation coeff -0.88 and they seem to be very closely negatively
correlated.
I also checked the scatter plots of predictors against Crime to have visual idea of the correlations, which
showed that all of them might not be significant for out model.
ISYE 6501 Week 5 HW
IV.
• lm –
In the next step, I first used linear regression model using all the attributes in the dataset to
create a baseline. The summary of this model shows only 6 attributes have a p-value < = 0.1
hence they are the only ones possibly significant enough. In real life applications with a bigger
volume of data usually this threshold would be at least .05 or lower, but as we do not have
enough data and 15 predictors, I am using a wider range. For this model, the R squared = 0.8031
[Show More]