Principal Component Analysis
In lecture we discussed how PCA can be used for dimensionality reduction.
Specifically, given a high dimensional dataset, PCA allows us to:
1. Understand the rank of the data. If principal
...
Principal Component Analysis
In lecture we discussed how PCA can be used for dimensionality reduction.
Specifically, given a high dimensional dataset, PCA allows us to:
1. Understand the rank of the data. If principal components capture almost all of
the variance, then the data is effectively rank .
2. Create 2D scatterplots of the data. Such plots are a rank 2 representation of our
data, and allow us to visually identify clusters of similar observations.
3. Create other low rank approximations of the data. Other than the 2D scatterplots
mentioned above, this is something we won't really do in DS100, so we've left it
as an optional exercise (question 4) at the end of this homework.
A solid geometric understanding of PCA will help you understand why PCA is able to
do these three things. In this homework, we'll build that geometric intuition, and will
will also look at PCA on two datasets: One where PCA works poorly, and another
where it works pretty well.
Due Date
This assignment is due Thursday, October 24th at 11:59pm PST.
Collaboration Policy
Data science is a collaborative activity. While you may talk with others about the
homework, we ask that you write your solutions individually. If you do discuss the
assignments with others please include their names in the cell below.
?
?
Collaborators: ...In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
#Note: If you're having problems with the 3d scatter plots, unco
mment the two lines below, and you should see a version that
# number that is at least 4.1.1.
# import plotly
# plotly.__version__
Question 1: PCA on 3D Data
In question 1, our goal is to see visually how PCA is simply the process of
rotating the coordinate axes of our data.
The code below reads in a 3D dataset. We have named the variable surfboard
because the data resembles a surfboard when plotted in 3D space.
In [2]:
surfboard = pd.read_csv("data3d.csv")
surfboard.head(5)The cell below will allow you to view the data as a 3d scatterplot. Rotate the data
around and zoom in and out using your trackpad or the controls at the top right of the
figure.
You should see that the data is an ellipsoid that looks roughly like a surfboard or a
hashbrown patty (https://www.google.com/search?
q=hashbrown+patty&source=lnms&tbm=isch). That is, it is pretty long in one direction,
pretty wide in another direction, and relatively thin along its third dimension. We can
think of these as the "length", "width", and "thickness" of the surfboard data.
Observe that the surfboard is not aligned with the x/y/z axes.
If you get an error that your browser does not support webgl, you may need to restart
your kernel and/or browser.
[Show More]