Question 4.1 – Clustering Models
Describe a situation or problem from your job, everyday life, current events, etc., for which a clustering
model would be appropriate. List some (up to 5) predictors that you might use.
...
Question 4.1 – Clustering Models
Describe a situation or problem from your job, everyday life, current events, etc., for which a clustering
model would be appropriate. List some (up to 5) predictors that you might use.
Our company is exploring the use of on-demand and shared-space office facilities for our
employees. This would help us shift our facility footprint from a few larger office buildings that
may not align well to our operating model to numerous smaller facility spaces, presumably
better aligned to operations. We are considering the following predictors:
1. Proximity to our clients
a. Current clients
b. Prospective and/or former clients
2. Proximity to our staff (zip codes)
3. Proximity to major airports and interstates
4. Facility cost ($/sf)
Question 4.2 – Iris Clustering
The iris data set iris.txt contains 150 data points, each with four predictor variables and one categorical
response. The predictors are the width and length of the sepal and petal of flowers and the response is
the type of flower. The data is available from the R library datasets and can be accessed with iris once
the library is loaded. It is also available at the UCI Machine Learning Repository
(https://archive.ics.uci.edu/ml/datasets/Iris ). The response values are only given to see how well a
specific method performed and should not be used to build the model.
Use the R function kmeans to cluster the points as well as possible. Report the best combination of
predictors, your suggested value of k, and how well your best clustering predicts flower type.
Examining the tabular data reveals four attributes (sepal length, sepal width, petal length, and
petal width) and three species. Plotting the petal lengths vs widths; sepal lengths vs widths;
petal lengths vs sepal widths; and sepal lengths vs petal widths collectively suggest three
clusters. Sepal length vs sepal width is not as revealing as the other three; and petal length vs
petal width shows the best cluster separation based on similar sizes within each species (and
vary significantly between each species). The plots are show for comparison on the following
page.
This study source was downloaded by 100000842525582 from CourseHero.com on 05-13-2022 05:33:43 GMT -05:00
https://www.coursehero.com/file/32154435/ISYE6501-Homework-2docx/
Since the plotted data suggests three clusters, initial k means clustering was conducted using all
four attributes and k = 3. Plotting the elbow diagram reveals a bend in the curve with diminishing
returns around k=3 or k=4.
Using kmeans, an expectation-maximization algorithm, with k=3 revealed a sum of squares of
88.4%, an accuracy of 89.333%, and a cluster center distance of 78.85144. Adjusting k to 2
revealed a lower sum of squares of 77.6%, and accuracy of 98.0%, and a cluster center distance
of 152.348. Adjusting k to 4 revealed a higher sum of squares of 91.6%, an accuracy of 84%, and
a distance to cluster center of 71.75951. The distance reveals how well
Plotting the predicted clusters reveals how well the data has split up among the different
species. “Petal Width v Petal Length” cluster assignment reveals the best clustering, as shown
below.
This study source was downloaded by 100000842525582 from CourseHero.com on
[Show More]