Data Analysis
Important Reminders
You must submit this file ( A4_DataAnalysis.ipynb ) to TritonED to finish the homework.
This assignment has hidden tests: tests that are not visible here, but that will be run on your
...
Data Analysis
Important Reminders
You must submit this file ( A4_DataAnalysis.ipynb ) to TritonED to finish the homework.
This assignment has hidden tests: tests that are not visible here, but that will be run on your submitted assignment for grading.
This means passing all the tests you can see in the notebook here does not guarantee you have the right answer!
In particular many of the tests you can see simply check that the right variable names exist. Hidden tests check the actual values.
It is up to you to check the values, and make sure they seem reasonable.
A reminder to restart the kernel and re-run the code as a first line check if things seem to go weird.
For example, note that some cells can only be run once, because they re-write a variable (for example, your dataframe), and change it in a way that means a second execution will fail.
Also, running some cells out of order might change the dataframe in ways that may cause an error, which can be fixed by re-running.
In [1]:
In [3]:
Notes - Assignment Outline
Parts 1-6 of this assignment are modeled on being a minimal example of a project notebook.
This mimics, and gets you working with, something like what you will need for your final project.
Parts 7 & 8 break from the project narrative, and are OPTIONAL (UNGRADED).
They serve instead as a couple of quick one-offs to get you working with some other methods that might be useful to incorporate into your project.
Setup
Data: the responses collected from a survery of the COGS 108 class.
There are 417 observations in the data, covering 10 different 'features'.
Research Question: Do students in different majors have different heights?
Background: Physical height has previously shown to correlate with career choice, and career success. More recently it has been demonstrated that these correlations can actually be explained by height in high school,
as opposed to height in adulthood (1). It is currently unclear whether height correlates with choice of major in university.
Reference: 1) http://economics.sas.upenn.edu/~apostlew/paper/pdf/short.pdf (http://economics.sas.upenn.edu/~apostlew/paper/pdf/short.pdf)
Hypothesis: We hypothesize that there will be a relation between height and chosen major.
Part 1: Load & Clean the Data
Fixing messy data makes up a large amount of the work of being a Data Scientist.
Collecting package metadata: done
Solving environment: done
## Package Plan ##
environment location: /Users/brianbarry/anaconda3
added / updated specs:
- patsy=0.5.1
The following packages will be downloaded:
package | build
---------------------------|-----------------
conda-4.6.7 | py36_0 1.6 MB
openssl-1.1.1b | h1de35cc_0 3.4 MB
patsy-0.5.1 | py36_0 376 KB
------------------------------------------------------------
Total: 5.5 MB
The following packages will be UPDATED:
conda 4.6.4-py36_0 --> 4.6.7-py36_0
openssl 1.1.1a-h1de35cc_0 --> 1.1.1b-h1de35cc_0
patsy 0.5.0-py37_0 --> 0.5.1-py36_0
Downloading and Extracting Packages
conda-4.6.7 | 1.6 MB | ##################################### | 100%
patsy-0.5.1 | 376 KB | ##################################### | 100%
openssl-1.1.1b | 3.4 MB | ##################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
# Run this cell to ensure you have the correct version of patsy
# You only need to do the installation once
# Once you have run it you can comment these two lines so that the cell doesn't execute everytime.
import sys
!conda install --yes --prefix {sys.prefix} patsy=0.5.1
# Imports - These are all you need for the assignment: do not import additional packages
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import patsy
import statsmodels.api as sm
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest
# Note: the statsmodels import may print out a 'FutureWarning'. Thats fine.
12345 123456789
10
11
[Show More]