Week 1
Why Analytics? 6
Data Vocabulary 7
Classification 8
Support Vector Machines 11
Scaling and Standardization 13
k-Nearest Neighbor (KNN) 13
Week 2
Model Validation 16
Validation and Test Sets 17
Splitting
...
Week 1
Why Analytics? 6
Data Vocabulary 7
Classification 8
Support Vector Machines 11
Scaling and Standardization 13
k-Nearest Neighbor (KNN) 13
Week 2
Model Validation 16
Validation and Test Sets 17
Splitting the Data 18
Cross-Validation 20
Clustering 21
Supervised vs. Unsupervised Learning 22
Week 3
Data Preparation 25
Introduction to Outliers 25
Change Detection 27
Week 4
Time Series Data 31
AutoRegressive Integrated Moving Average (ARIMA) 34
Generalized Autoregressive Conditional Heteroskedasticity (GARCH) 34
Week 5
Regression 37
Regression Coefficients 39
Causation vs. Correlation 39
Important Indicators in the Output 40
Week 6
De-Trending 43Principal Component Analysis (PCA) 44
Week 7
Intro to CART 47
How to Branch 48
Random Forests 48
Logistic Regression 49
Confusion Matrices 51
Week 8
Intro to Variable Selection 53
Models for Variable Selection 53
Choosing a Variable Selection Model 56
Week 9
Intro to Design of Experiments 59
Factorial Design 60
Multi-Armed Bandits 61
Intro to Advanced Probability Distributions 62
Bernoulli, Binomial, and Geometric Distributions 62
Poisson, Exponential and Weibull Distributions 63
Q-Q Plot 65
Queuing 66
Simulation Basics 66
Prescriptive Simulation 68
Markov Chains 68
Week 10
Intro to Missing Data 71
Dealing with Missing Data 71
Imputation Methods 72
Intro to Optimization 73Elements of Optimization Models 74
Modeling with Binary Variables 74
Week 11
Optimization for Statistical Models 76
Classification of Optimization Models 79
Stochastic Optimization 81
Basic Optimization Algorithms 82
Non-Parametric Models 82
Bayesian Modeling 83
Communities in Graphs 83
Neural Networks and Deep Learning 84
Competitive Models 86Life is full of mysteries.
Although that can feel a bit overwhelming at times, the interesting thing is that we can use math
to explain a lot of what we see as the "unknown."
In fact, that's the goal of the field of analytics.
Rather than looking at our businesses or organizations and wondering what will work and what
won't, we can use analytics to sift through our data to explain why something happened or why
one idea will work while another won't.
If you're interested in learning more about how that works, you're in the right place.
In this post, I go through the content in week 1 of ISYE 6501 to make sense of what analytics is
and how we can use simple machine learning models to make better decisions.
Why Analytics?
We can use analytics to answer important questions, and we can break those questions down
into three types:
● Descriptive Questions: What happened? What effect does spin rate have on how hard
someone hits the ball? Which teachers in the school produce the best exam results?
● Predictive Questions: What will happen? How much will the global temperature
increase in the next 100 years? Which product will be most popular?
● Prescriptive Questions: What action(s) would be best? When and where should
firefighters be placed? How many delivery drivers should the pizza shop have on hand
on certain days and times?
In short, we can use analytics to make sense of the world around us and to make better
decisions in a complex world.
And we do this through something called modeling.Modeling is a way to mathematically explain a real-world situation so that we can understand
why something happened (or will happen) and what we can do about it, but people often use the
word 'modeling' to mean three different things:
● A real-life situation expressed as math
● Analyze the math
● Turn math analysis back to real-life solution
To give an example of how the word 'model' can be used in different ways, know that all of the
following are 'models':
● Regression
● Regression based on size, weight, and distance
● Regression estimate = 37 + 81 x Size + 76 x Weight + 4 x Distance
Later in this post, we'll look at some of the more popular machine learning models.
Data Vocabulary
Data Table: A display of information in a grid-like format of rows and tables.
Row: Contains the record or data of the columns
Column: Contains the name, data type, and any other attribute of the dataStructured Data: Data that can be stored in a structured way (like in the table above).
Unstructured Data: Data not easily stored or described (i.e. text from social media)
Quantitative Data: Numbers with a meaning (i.e. 3 baseballs)
Categorical Data: Numbers without meaning (i.e. an area code or country of origin)
Binary Data: Data that takes one of two values (i.e. yes or no)
Unrelated Data: No relationship between data points (i.e. players on different teams)
Time Series Data: Same data recorded over time (i.e. an athlete's performance over time)
Scaling Data: Transforming your data so that features are within a specific range (i.e. 0-1)
Standardizing Data: Change your observations so they can be described as a normal
distribution
Validation: Verifying that models are performing as intended
Classification
Classification is just what it sounds like: putting things into categories.
In the real world, what this might look like would be an email service classifying an email as
spam or not spam or an artificial intelligence umpire classifying a pitch as a ball or a strike.The simplest classification would be something like 'yes' or 'no'.
But you can have several categories as well.
For example, you might want to break a population down by household income:
● $59,990 or less per year
● $60,000-$99,999
● $100,000 and up
To use classification as a prediction, you need other data points. In the example above, we
could collect data on things like education level, gender, race, age, or a range of other options
that we could use to predict which category people will fall into.
We want to choose a good classifier when using classification to minimize our errors.However, choosing a classifier isn't always as simple as the graph above because our data
won't always split as cleanly as this example.
Sometimes it might look like this.
In an example like this, we need to figure out what classifier minimizes our errors so we still
have something productive.
So, when using classification, you have two different classifier options:● Hard Classifiers: Classifies into groups perfectly
● Soft Classifiers: Gives as good of a separation as possible
How you decide where to draw your classifying line depends a lot on the outcomes of what
you're classifying. If you're classifying whether something is explosive or not explosive, for
example, you may want to err on the side of classifying a non-explosive object as explosive
rather than the other way around.
One important thing to note is if you have a situation where you're classifier is either completely
horizontal or vertical.
If your classifier line is vertical, only the variable on the x-axis matters for classification, while if
your classifier line is horizontal, only the variable on the y-axis matters for classification.
Support Vector Machines
Support vector machines are supervised machine learning models used for classification.
The name support vector comes from the idea of having a line that touches the edge of the
shape (or 'supports' it) is called a support vector.
The support vector machine automatically (machine) determines support vectors, or the points
supporting the shape on parallel lines.
The goal is to maximize (or optimize) the space between the support vectors to minimize errors
between the classes.
● n data points,
● m number of attributes,
● Xij is ith attribute of j data point (i.e. i = 1 is credit score of person j, i = 2 is income of
person j)● yj would be considered the response, and it could be colored based on the classifier line.
The classification line equation, where a0 is the intercept, and ai is the coefficient (if close to
0, it is not relevant for classification) looks like the following.
The goal of SVM is to maximize the margin (distance between two classification lines), keeping
everything classified correctly.
● In soft classification, you trade off maximizing the margin, and minimizing the error.
The following equation seeks to minimize the error and the margin.
Lambda λ controls the weight, so as it grows, the margin outweighs any error, and as it
becomes zero, minimizing mistakes becomes much more important. We can add a multiplier mj
per error to weigh the errors, with the larger multiplier being more important than a smaller one.
Although the example I showed in the graph above is just two dimensions, you can use SVMs
on data with as many dimensions as you want. It's also important to note that the classifier line
doesn't need to be a straight line.Finally, you don't need to always classify into hard buckets because you can use SVMs to give
recommendations in probabilities.
For example, you could build your SVM and then say there's a 87% chance that person A falls
into the $100,000+ income category.
Scaling and Standardization
One issue with SVMs models is that our model may be thrown off if we have data that varies
widely in range. Remember that SVM's goal is to maximize the distance between the separating
plane and the support vectors.
If one feature is much bigger than another (i.e X1 is .3-.6 and X2 is 1000-2000), the large range
will dominate the model and throw off our results.
So, you need to scale (or normalize) your data so that all features are, for example, between 0
and 1 (the most common scaling).
However, you can also scale to a normal distribution.
When you scale to a normal distribution (standardizing), you scale the data to a mean of 0 and a
standard deviation of 1.
So, when do you use each one?
You use scaling (or normalizing) when you're working with data in a bounded range like:
● Batting average (.000-1.000)
● SAT Scores (200-800)
On the other hand, you use standardization with certain models like:
● Principal component analysis (PCA)
● Clustering
Sometimes, you just have to try both to see which one works better.
k-Nearest Neighbor (KNN)
The k-Nearest Neighbor (KNN) algorithm is another algorithm used to classify data.
Rather than using a line to separate data into classes, the KNN algorithm classifies data by
looking at a data point's "nearest neighbors."
Based on its neighbors, the algorithm then classifies the data.The amount of points we use as neighbors is referred to as 'k'.
There's no set number of 'k' neighbors to use - it could be 5, 7, etc - it just comes down to
testing and validating to see what returns the best results.
And you can use this model to classify data into multiple classes.
There are three important things to note about KNN when classifying points that will affect your
analysis:
● There's more than one way to measure distance (straight line is the most common, but
there are others as well)
● Some attributes might be more important than others in classification
● Unimportant attributes can be removed
In the validation stage, you nail these down.
We can use analytics to make sense of the world around us.
One common strategy analysts use to do this is to use classification models.
Specifically, Support Vector Machines and k-Nearest Neighbors are two of the most common
classification models that we use. We can use these strategies to classify everything from
whether or not email is spam to whether or not a foreign object is a bomb or not.
They will be highly useful models for you moving forward.This week, we cover several concepts.
We start off with model validation, work into validation and test sets, splitting data,
cross-validation, clustering, and the difference between supervised and unsupervised machine
learning models.
Let's dive in.
Model Validation
When you take your model to whoever is overseeing your project (or whoever you're trying to
convince of some hypothesis you have), the first thing they'll ask is likely, "how good is your
model?"
In other words, how accurate is it?
You answer that question by validating your model (validation is determining how well your
model performs)
This could be:
● How well it predicts who will win a tournament
● How well it predicts spam
● How well it predicts a successful application
Or any problem that we're trying to solve with a model.
When validating our model, it's important to know that our data has two types of patterns:
● Real Effect: Real relationship between attributes and response
● Random Effect: Random, but looks like a real effect
If we fit a solution to our training set, we're finding something that fits both the real effects and
the random effects.
To solve this problem, we need to run our data on different data because only real effects will
remain (the random effects on the new data will be different).
For example, let's say I poll a group of people at a bar about their eye color.
If I took a set of NBA players and created a model based on their attributes, I may come up with
a model that predicts their height well. However, if I then apply that model to the general
population, it likely won't be as accurate because NBA players are often outliers in terms of their
size.That's why we can't measure a model's effectiveness on the data it was trained on.
But there's a solution to the problem.
Validation and Test Sets
To get around this problem, we need two sets of data.
One set of data will be a larger set of data to fit the model, while the second set of data will
measure the model's effectiveness.
That's why we split the dataset into two sets: the training set and the validation set.
To further weed out any randomness in the model, we put the data through a third dataset called
the test set.
In short, it goes as follows:
1. The training set is for building a model
2. The validation set is for picking a model
3. The test set is for estimating the performance of the model we picked
You can use the following flow chart to better understand when to use each.Next, we'll discuss how to split that data.
Splitting the Data
While there are no rules for how big each dataset should, there are some guidelines.
For example, when working with one model (and only need a training and test set), the rule of
thumb is:
● 70-90% training
● 10-30% test
When comparing models (and need training, validation, and test sets), the rule of thumb is a
little bit different:
● 50-70% training
● Split the rest evenly between validation and test set
Again, the image from above does a good job highlighting this.In terms of how to actually split up the data, there are different methods.
One of those methods works as follows (assuming we have a 1,000 data points and want to
split it into a 60% training set, a 20% validation set, and a 20% test set).
It's called the Random method:
● Randomly choose 600 data points for the training set
● Randomly choose 200 data points for validation
● Use the remaining 200 data points for the test set
The second method is called the Rotation method:
● Take turns choosing points
● Training -> Validation -> Training -> Test -> TrainingOne advantage of the Rotation method is that the data points are split more evenly so that we
don't clump data together. For example, if we have 20 years of data, the random method might
end up putting a higher portion of data from years 1-5 into the training set than the rest of the
data.
However, rotation can introduce its own bias issues.
If you have a 5-point rotation and are working with Monday-Friday data, all Monday data would
end up in one set, all Tuesday in another, etc.
A solution to that problem would be to combine the two methods.
You could randomly split 50% of Monday into the training set, 50% of Tuesday, etc.
Cross-Validation
Some people worry that important data points may get left out of the training data.
Cross-validation is a way to work around that issue, and there are several types of
cross-validation. A few are:
● k-means
● k-nearest neighbor
● k-fold
The k-fold method allows us to use every part of the data set to train.After going through this process, you will have used every data point to train your models - so
no data will have been left out.
You can then average the k evaluations to estimate the model's quality. Although there's no
standard number to use as k, 10 is common.
Clustering
Clustering in analytics is similar in meaning to what we think of it as in everyday life.
Clustering just means taking a set of data points and dividing them up into groups so each
group contains points that are similar to each other.
One reason we would use clustering would be to segment a market to improve your messaging
(i.e. email marketing).
For example, some people might buy a course because it increases their income, while another
cluster of people may be most likely to buy a course because they just like to learn. Clustering
allows you to identify these people and serve them the appropriate message.
The following measures the distance between points, or the P-norm distance, where P is the
power in the equation.
For ∞ norm distance, the biggest distance between a set of points.One example of clustering is k-means clustering.
With k-means clustering, the goal is to partition n observations (however many data points you
have) into k clusters (however many clusters you want to create) - and each observation will
belong to the cluster with the nearest mean.
It would look similar to the image above (where n = 15 and k = 3)
K-means is a 'heuristic' algorithm, which means that it may not always find the best solution, but
it finds good clusterings and it finds them relatively quickly.
It's also an example of an expectation-maximization (EM) algorithm.
[Show More]