CSC550Z: Data Mining & Distributed Computing (Summer 2019)
Week 1 Assignment Solution (100 points)
2.1 Assuming that data mining techniques are to be used in the following cases, identify whether the task required
...
CSC550Z: Data Mining & Distributed Computing (Summer 2019)
Week 1 Assignment Solution (100 points)
2.1 Assuming that data mining techniques are to be used in the following cases, identify whether the task required is supervised or unsupervised learning. (30 points)
2.1.a Deciding whether to issue a loan to an applicant based on demographic and financial data (with reference to a database of similar data on prior customers).
2.1.b In an online bookstore, making recommendations to customers concerning additional items to buy based on the buying patterns in prior transactions.
2.1.c Identifying a network data packet as dangerous (virus, hacker attack) based on comparisons to other packets whose threat status is known.
2.1.d Identifying segments of similar customers.
2.1.e Predicting whether a company will go bankrupt based on comparing its financial data to those of similar bankrupt and nonbankrupt firms.
2.1.f Estimating the repair time required for an aircraft based on a trouble ticket.
2.1.g Automatic sorting of mail by zip code scanning.
2.1.h Printing of customer discount coupons at the conclusion of a grogery store checkout based on what you just bought and what others have bought recently.
2.2 Describe the difference in roles assumed by the validation partition and the test partition (10 points)
2.3 Consider the sample from a database of credit applications shown in Figure 2.13. Comment on the likelihood that it was sampled randomly, and whether it is likely to be a useful sample. (10 points)
Answer:
2.5 Using the concept of overfitting, explain why when a model is fit to training data, zero error with those data is not necessarily good. (10 points)
Answer:
2.8 Normalize the data in Table 2.3. (30 points) Answer:
Normalization of a measurement is obtained by subtracting the average from each measurement and dividing the difference by the standard deviation.
2.10 Two models are applied to a dataset that has been partitioned. Model A is considerably more accurate than model B on the training data, but slightly less accurate than model B on the validation data. Which one are you more likely to consider for final deployment. (10 points)
Answer:
[Show More]