Information Technology > QUESTIONS & ANSWERS > University of Maryland, Baltimore County - IS 733assign4 (All)

University of Maryland, Baltimore County - IS 733assign4

Document Content and Description Below

Student id: - QH45693 IS 733- Data Mining 1(a) Briefly describe the boosting algorithm. State why it may improve classification accuracy. 1(b) What is the bias-variance trade-off for machine lear... ning methods? Explain. 1(c) Briefly describe the bagging procedure. Discuss why it may improve the accuracy of decision tree classifiers, in terms of the bias-variance trade-off. 2. (a) Using any software tool or programming language of your choice, create and print out a scatter plot of this dataset, eruption time versus waiting time. Note that for many tools, before the data can be loaded you will need to make a copy of the file and delete the header information. You will need to ignore the first column, which contains ID numbers for each instance. 2(b) How many clusters do you see based on your scatter plot? For the purposes of this question, a cluster is a “blob” of many data points that are close together, with regions of fewer data points between it and other “blobs”/clusters. 2(c) Describe the steps of a hierarchical clustering algorithm. Based on your scatter plot, would this method be appropriate for this dataset? Question 2b. I recommend using a high-level data-friendly programming language such as MATLAB, R, or python. Be sure to ignore the first column, which contains instance ID numbers. Report the following items: • Your source code for the k-means algorithm. You do not need to report code for loading the data, or for drawing a scatter plot. You need to implement the algorithm from scratch. • A scatter plot of your final clustering, with the data points in each cluster color-coded, or plotted with different symbols. Include the cluster centers in your plot. • A plot of the k-means objective function versus iterations of the algorithm. Recall that the objective function is E = X k i=1 X p∈Ci kp - cik 2 , where k is the number of clusters, Ci is the set of instances assigned to the ith cluster, and ci is the cluster center for the ith cluster. Note that the objective function should always decrease. If this is not the case, look for a bug in your code. • Did the method manage to find the clusters that you identified in Question 2b? If not, did it help to run the method again with another random initialization [Show More]

Last updated: 2 years ago

Preview 1 out of 8 pages

Buy Now

Instant download

We Accept: