Georgia Tech ISYE 6501: Introduction to Analytics Modeling Professor: Dr. Joel Sokol Homework 3. 100% pass rate.

Document Content and Description Below

ISYE 6501: Introduction to Analytics Modeling Professor: Dr. Joel Sokol Homework 3 31 January 2018Overview This week’s lesson involves data preparation, including outlier identification, handlin ... g outliers, and an introduction to change detection. Data preparation involves inspecting data visually for outliers and using a statistical test, Grubbs Test, to detect outliers in a univariate data set assumed to come from a normally distributed population. The null and alternative hypotheses are two mutually exclusive statements about a population. A hypothesis test uses sample data to determine whether to reject the null hypothesis. The null hypothesis states that all the data values come from the same normal distribution. The alternative hypothesis states that either the smallest or largest data value is an outlier.1 The CUMSUM test is used for change detection. CUSUM: St = max{0, St-1 + (xt – mu - C)} Is St >= T? Calculate metric St and declare an observed change when St goes above some threshold (T). At each time period, observe xt and see how far above the expectation it is (xt – mu) and add it to the previous period’s metric (St-1). Take the max of 0 and that value (essentially keep the value if it’s > 0), else reset running total to zero. Sometimes there are random values (up to 50% of time), so we include a value C to pull the running total down a little bit. The bigger the C, the harder it is for to St to get large and the LESS SENSITIVE the model is. The smaller the C, the more sensitive the model is since St can get larger faster. How do you choose good values for C and t so the model is finds changes quickly but isn’t too sensitive? Use data! Evaluate how costly the C and T boundaries are to your situation. Higher T = slower detection but less false detection changes. Lower T = faster detection but more likely to falsely detect changes. Question 5.1 – Crime Data Analysis Using crime data from http://www.statsci.org/data/general/uscrime.txt (description at http://www.statsci.org/data/general/uscrime.html), test to see whether there is an outlier in the last column (number of crimes per 100,000 people). Use the grubbs.test function in the outliers package in R. Setting up script and data file: First, install and load the outliers package. Then load the crime data from “uscrime.txt.” Load the crime column (last col) from the text file into a variable named “crime.” Inspect the variable contents to ensure data was loaded correctly, and assess data. Min. 1st Qu. Median Mean 3rd Qu. Max. 342.0 658.5 831.0 905.1 1057.5 1993.0 The mean is 905.1 with a min of 342 and a max of 1993. 1 https://support.minitab.com/en-us/minitab/18/help-and-how-to/statistics/basic-statistics/how-to/outliertest/interpret-the-results/all-statistics-and-graphs/Next, confirmed the data follows an approximate normal distribution. This is shown through the middle portion of the Q-Q Plot, as seen below. After confirmation of a normally distributed population, next use the Grubbs’ Test function to test for one outlier, two outliers on one tail, or two outliers on opposite tails, in our small data set. Determine significance level: A standard default level is 0.05, and A significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference. We’ll choose a higher significance level of 0.10, to be more certain that you detect any difference that possibly exists. grubbs.test(crime, type = 10, opposite = FALSE) Using the funtion with “type=10” denotes a test for one outlier, and “opposite = FALSE” indicates checking the value with the furthest distance from the mean. In our case, the max value is the furthest from the mean, so we’re testing whether the max value is an outlier. Grubbs’ Test results are shown below: Grubbs test for one outlier data: crime G = 2.81290, U = 0.82426, p-value = 0.07887 alternative hypothesis: highest value 1993 is an outlier The test results reveal a p-value of 0.07877 and an alternative hyplothesis that the highest value (1993) is an outlier. Since our significance level is determined to be 0.10, we declare that a p-value of 0.07887 indicates the data point is an outlier, which therefore indicates that the highest data point of 1993 isindeed an outlier. The decision is to reject the null hypothesis and conclude that this is an outlier. Therefore the highest-crime city is likely an outlier. Is the lowest-crime city an outlier? Rerun the Grubbs’ Test using grubbs.test(crime, type = 10, opposite = TRUE) Using the funtion with “type=10” denotes a test for one outlier, and “opposite = TRUE” indicates checking the value opposite of the furthest distance from the mean (in this case the min value). Grubbs test for one outlier data: crime G = 1.45590, U = 0.95292, p-value = 1 alternative hypothesis: lowest value 342 is an outlier The outlier test results for the min value (342) reveals a p-value of 1 and an alternative hyplothesis that the lowest value (342) is an outlier. Since our significance level is determined to be 0.10 and our p-value is greater than the significance level, the decision is to fail to reject the null hypothesis because there is not enough evidence to conclude that an outlier exists. Therefore the lowest-crime city is not an outlier. Is the highest-crime city an outlier? Yes, as determined in the first Grubbs’ Test cited above. Are there others? Since we determined that the min value is not an outlier and that the max value is an outlier, the next logical data point would be to test the second highest value. To do this, we would remove the max value from the data set and re-run the Grubbs’ Test on the new max. Creating a new dataset named “new_crime” using new_crime = crime[-which.max(crime)] reveals a new max of 1993 (vs 1969). Grubbs test for one outlier data: new_crime G = 3.06340, U = 0.78682, p-value = 0.02848 alternative hypothesis: highest value 1969 is an outlier The outlier test results for the new max value of 1969 reveals a p-value of 0.02848 and an alternative hyplothesis that the max value (1969) is an outlier. Since our significance level is determined to be 0.10 and our p-value is less than the significance level, the decision is to reject the null hypothesis and conclude that this is an outlier. Therefore the second highest-crime city is likely an outlier. Repeating this again reveals a third max value point of 1674. Conducting the Grubbs’ Test on this value revealst the following: Grubbs test for one outlier data: newer_crime G = 2.56460, U = 0.84712, p-value = 0.1781 alternative hypothesis: highest value 1674 is an outlier The outlier test results for the third max value (1674) reveals a p-value of 0.1781 and an alternative hyplothesis that this value is an outlier. Since our significance level is determined to be 0.10 and our pvalue is slightly greater than the significance level, the decision is to fail to reject the null hypothesis because there is not enough evidence to conclude that an outlier exists. Therefore the third highestcrime city is not an outlier. This suggests there are no other outliers.One final visual assessment of outliers is to box plot the crime data. Visual inspection reveals two data points far above the top box plot whisker with a third data point slightly above the top box plot whisker. There are no data points shown below the lower box plot whisker. This visual inspection suggests that there may be three outliers on the max side of the data set. Our analysis suggest that the two highest data points are outliers but suggests that the third highest data point is not an outlier and that the min data point is also not an outlier. This is based on choosing a significance level of 0.10. Using a significance level of 0.5 would have suggested the null hypothesis of no outliers. Using a loop to run through the dataset makes the entire effort much quicker2. grubbs.flag <- function(x) { outliers <- NULL test <- x grubbs.result <- grubbs.test(test) pv <- grubbs.result$p.value while(pv < 0.10) { outliers <- c(outliers,as.numeric(strsplit(grubbs.result$alternative," ")[[1]][3])) test <- x[!x %in% outliers] grubbs.result <- grubbs.test(test) pv <- grubbs.result$p.value } return(data.frame(X=x,Outlier=(x %in% outliers))) } With a significance value of 1.0, the loop function reveals the following table: 2 https://stackoverflow.com/questions/22837099/how-to-repeat-the-grubbs-test-and-flag-the-outliers # Value Outlier # Value Outlier # Value Outlier 1 791 FALSE 17 539 FALSE 33 1072 FALSE 2 1635 FALSE 18 929 FALSE 34 923 FALSE 3 578 FALSE 19 750 FALSE 35 653 FALSE 4 1969 TRUE 20 1225 FALSE 36 1272 FALSE 5 1234 FALSE 21 742 FALSE 37 831 FALSE 6 682 FALSE 22 439 FALSE 38 566 FALSE 7 963 FALSE 23 1216 FALSE 39 826 FALSE 8 1555 FALSE 24 968 FALSE 40 1151 FALSE 9 856 FALSE 25 523 FALSE 41 880 FALSE 10 705 FALSE 26 1993 TRUE 42 542 FALSE 11 1674 FALSE 27 342 FALSE 43 823 FALSE 12 849 FALSE 28 1216 FALSE 44 1030 FALSE 13 511 FALSE 29 1043 FALSE 45 455 FALSE 14 664 FALSE 30 696 FALSE 46 508 FALSE 15 798 FALSE 31 373 FALSE 47 849 FALSE 16 946 FALSE 32 754 FALSE [Show More]

Last updated: 3 years ago

Preview 1 out of 11 pages

Buy Now

Instant download

We Accept: