1. Introduction

Two definitions of Machine Learning:

  1. “The field of study that gives computers the ability to learn without being explicitly programmed.” - Arthur Samuel (older, informal definition)
  2. “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” - Tom Mitchell (more modern definition)

Generally speaking, any machine learning problem can be put into one of two broad classifications:

  1. Supervised learning
    • You are given a data set and already know what the correct output should look like as there is a relationship between the input and output
    • Can be further categorized into:
      • Regression Problems - trying to map input variables to some continuous function, predicting results within a continuous output
      • Classification Problems - trying to map input variables into discrete categories, predicting results in a discrete output
  2. Unsupervised learning
    • Allows you to approach problems with little or no idea what our output should look like, we are deriving structure from data where we do not know the effects of the variables
    • This structure can be derived from clustering data based upon relationships among the variables in the data
2. Model and Cost Function

Model Diagram

(xi,yi) -> training set for the model x -> input and y-> output with i = 0, 1, … , m h -> hypothesis

Linear Regression with One Variable

Hypothesis: h𝜃 (x) = 𝜃0 + 𝜃1x 𝜃i’s: Parameters

Choose 𝜃0 , 𝜃1 so that h𝜃 (x) is close to y for our training examples (x,y) Minimization problem of 𝜃0 , 𝜃1 : ( 1 / 2m) * Sum[i to m] of (h𝜃 (x) - y ) 2 This is the same as the Cost Function (or the Squared Error Function/Mean squared error)

Cost Function

3. Parameter Learning

Gradient Descent Algorithm Note: It must be a simultaneous update for each theta-i

Gradient descent works to take the derivative of our cost function in order to get a direction to move. The size of each step is determined by alpha - called the learning rate.

Derivative is the partial derivative of the Cost Function with respect to theta-i

It is finished once you have reached the bottom of the graph

Learning Rate

Gradient Descent with Linear Regression

4. Multivariate Linear Regression

Multivariate Linear Regression is just Linear Regression with more than one variable

Multivariate Linear Regression

An easy way to write the above formula is simply by h𝜃 (x) = 𝜃(transpose) * x

Vector Form

We need to change our Gradient Descent algorithm in order to use multiple variables. Not that for each theta-i, we are just taking the partial derivative of the cost function which gives us the xj term, removes the squared and cancels out the 2 in the 1/2m term.

Be sure to fully understand the equation:

  • Theta term - represents the different weights or parameters we associate with a given feature
  • Alpha - learning rate or step size
  • n - number of features
  • h(x) - hypothesis equation, what we are guessing to fit the data
  • m - number of training sets
  • x(i) - feature values from training set
  • y(i) - predicted values from training set
  • J(theta) - cost function or squared error function, telling you the difference between true values and model
  • *** We are taking the partial derivative of J(theta) with respect to theta-j ***

Multiple Variables Gradient Descent

Feature Scaling: If we have our features on a similar scale then gradient descents can converge more quickly. We can easily scale them by dividing the features by their maximum values, which would make them shrink to between 0 <= x <= 1. ‘Get every feature into approximately a: -1 <= xi <= 1 range.’ - not a hard set rule, as long as it is a small range that is not astronomically large or small. Try to keep between -3 to 3 or -1/3 to 1/3.

Mean Normalization: Replace xi with xi-ui to make features have approximately zero mean (Do not apply to x0 = 1). Basically just making the features have around an average mean of zero so subtract by middle value (avg value of x) and divide by range (max-min).

Feature Scaling and Mean Normalization

A good way to debug gradient descent is to plot the cost function over the number of iterations. If J(theta) is increasing then use a smaller alpha.

If alpha is sufficiently small, then J(theta) will decrease on every iteration.

Convergence

Below is another tip to improve Gradient Descent:

Polynomial Regression

Note: if we do try to fit our model with a quadratic, cubic or square root we do need to feature scale because these ranges will compound on one another with the squared and cubed values.

5. Computing Parameters Analytically

Normal equation: Method to solve for Theta analytically

To minimize - take derivative and set to zero

Normal Equation

***taken from https://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression to calculate the derivation ***

*** to full understand the last step of “we derive by each component of the vector, and then combine the resulting derivatives into a vector again.” See https://eli.thegreenplace.net/2015/the-normal-equation-and-matrix-calculus/ *** -it involves the use of Matrix Identities, Symmetric Matrices and a knowledgeable understanding of partial derivatives and Matrices in general

THERE IS NO NEED TO DO FEATURE SCALING WITH THE NORMAL EQUATION

Gradient Descent

  • Need to choose alpha
  • Needs many iterations
  • O(kn^2)
  • Works well even when n is large

Normal Equation

  • No need to choose alpha
  • Don’t need to iterate
  • Need to compute inverse O(n^3)
  • Slow if n is very large

GD vs Normal

If X^T *X is non-invertible (singular vs degenerate) (Remember non-invertible is due to redundant features [linearly dependent in mathematic terms] or too many features [m<=n] where we either delete some features or use regularization)

6. Hypothesis Representation and Decision Boundary

Logistic Function

Statistical Logistic/Sigmoid Function with asymptotes at y = 0 and y = 1 towards negative infinity and infinity respectively.

NOTE: Remember that the decision Boundary is a trait of the hypothesis not the training sets. It comes from theta = …

Decision Boundary

7. Logistic Regression Model

Cost Function - Logistic

-log(z) if y = 1 and -log(1-z) if y = 0 are very useful for optimizing our cost function as they allow us to ‘punish’ our learning algorithm for values that deviate from our expected values due how they approach infinity along the y axis.

This also ensures our cost function will be convex so that we do not have any local minima

We can rewrite our cost function so that it only contains one line. *Remember that y can only equal 0 or 1 always as this is the mathematical definition of y

Cost(h(x),y) = -ylog(h(x)) - (1-y)log (1 - h(x))

Simplified Cost Function and Gradient Descent

Optimization algorithms:

  • Gradient descent
  • Conjugate gradient
  • BFGS
  • L-BFGS

Advantages (conjugate, BFGS, L-BFGS)

  • No need to manually pick alpha
  • Often faster than gradient decent Disadvantages:
  • More complex

Advanced Optimization

We have more efficient and sophisticated algorithms than gradient descent but we do not need to write them ourselves, we can instead pull from Octave libraries in order to use.

Now, if we have multiple classes we want to classify, we can use the One-vs-all classification approach. We are predicting the probability that ‘y’ is a member of one of our classes while we combine all the other classes into one ‘negative’ class compared to the characteristic we are looking for.

Multiclass Classification

8. Solving the Problem of Overfitting

“Underfit”/“High Bias” - using a linear prediction model when a higher order model would be more fitting “Overfit”/“High Varience” - using too high of a polynomial for a model

Overfitting: If we have too many features, the learned hypotheses may fit the training set very well but fail to generalize to new examples.

Overfitting

Note: This terminology is applied to both linear and logistic regression

We can help to mitigate the issue of overfitting through two different options:

  1. Reducing the number of features
  2. Regularizing the magnitude of parameters

Methods to Solve Overfitting

The idea behind regularization is giving us a “simpler” hypothesis by penalizing some of the parameters to have small values so that they become less prone to overfitting.

Regularization

Modify cost function to add a term at the end (regularization term) that will include lambda (regularization parameter). It controls the tradeoff of keeping theta parameters small and fitting the training set well.

Regularization Parameter

If lambda is set too high, then we will underfit the data and made our theta terms too small - basically a flat line

Regularized linear regression:

Regularized Linear

Regularized logistic regression:

Regularized Logistic

9. Neural Networks

We are modeling a neural network by taking in input and performing some calculation and spitting out an output.

Sigmoid (logistic) activation function.

Netwrok has multiple layers Layer 1 - input layer Layer 2 - Hidden layer (do not observe values) Layer 3 - output layer

Model Representation

Essentially, our parameters are now called weights but they act in the same way. There is not a lot of new information, just renaming certain aspects that we have already covered. The new edition would be the linear progression of x -> a -> h(x).

Neural Network

We are essentially doing the same thing as logistic regression, our features are just now a0 a1 a2 and a3 instead of just x0 x1 x2 x3

These are learned features making this pretty complex features to help add flexibility about what it feeds into our calculation for our output layer.

Architectures refer to how the neurons are connected.

Neuron Connections

Below is an example of predicting x1 AND x2 (the logical ‘and’ operator) and the x1 OR x2, either x1 is true or x2 is true or they are both true.

AND / OR

Now we can make a more complex logical operator (in this case the XNOR logical operator) using a neural network.

Examples II

This can be taken further to have a multiclass classification by returning a vector.

Multiclass Classification

10. NN Cost Function and Backpropagation

Cost Function

Now we want to find parameters to minimize J(theta). So we need to compute J(theta) and partial derivative of J(theta) wrt theta.

Backpropagation Algorithm

Given a training set we want to calculate the minimization of J(theta) along with its partial derivative.

If you had two training examples (x1, y1) and (x2, y2) - you would perform Forward Propagation on x1, Back Propagation on y1, then Forward on x2 and Back on y2.

Backpropagation Cont.

Some more walkthrough on Back-Propagation.

Backpropagation Cont. II

Now we can ‘unroll’ our parameters to make it easier to use for our higher complexity algorithms

Unrolling Parameters

Gradient Checking: Make sure you forward/back Prop will be working correctly and accurately

Gradient Checking is very slow and costly when compared to backprop code, which is why we turn it off in order to use our backprop.

Gradient Checking

We can use random initialization in order to ensure that we have weights that will work.

Putting it all together:

Full Example

11. Evaluating a Learning Algorithm

Machine Learning Diagnostic: Testing what is and isn’t working with a learning algorithm and gain guidance to how to best improve.

Evaluating a Hypothesis

Model Selection and Data Sets

High Bias -> Underfit High Variance -> Overfit

As we increase the degree of polynomial d, the Training error tends to decrease. Cross validation has a more quadratic appearance. By looking at the cross validation error, we can determine if it is underfit or overfit

Bias (underfit): J train(theta) will be high and J crossVal(theta) = J trains(Theta)

Variance (overfit): J train(theta) will be low and J crossVal(theta) »

Diagnosing Bias vs Variance Regularization and Bias/Variance Learning Curves

Being sure to know what is the actual problem is the main component in trying to “decide what to do next”

12. Machine Learning System Design

Prioritization

A major takeaway from this section is the idea to list out all the possible ways you can approach the problem. Far too often someone will come up with an idea then instantly devote 6 months of time to that when there may be an alternative approach that would be more effective.

“Use evidence to guide our learning, not gut feeling.”

Error Analysis

“Key test that I often ask myself are first, can a human expert look at the features x and confidently predict the value of y because thats sort of a certification that y can be predicted accurately from the features x and second, can we actually get a large training set, and train the learning algorithm with a lot of parameters in the training set and if you can do both then thats more often give you a very kind performance learning algorithm.”

13. Support Vector Machines

We take logistic regression and modify it a little bit into a straight line that mimics logistic regression.

Adjust Logistic Adjust Logistic II

These new functions are called cost subscript 1 of z and cost subscript 0 of z (denoting when y is equal to 1 versus when y is equal to zero).

SVM Hypothesis

We now add a term C to be multiplied on A in order to replicate our lambda value in our regularization term. If C = 1/lambda then the equations will be equivalent.

Support Vector Machine

This is our overall optimization objective function for the SVM and if we minimize that function then we have the parameters learned by the SVM.

Unlike logistic regression, the support vector machine doesn’t output the probability but outputs one if theta transpose x is greater or equal to zero and will predict zero otherwise.

Decision Boundary

Tries to establish a large margin between its decision boundary. It is also called the large margin classifier.

If only using large margin classifier, it can be susceptible to outliers.

Non-Linear Decision Boundary

Kernels:

Kernel Kernel and Similarity Landmarks SVM w/ Kernels SVM w/ Kernels II SVM parameters SVM parameters II Kernel (similarity) functions Logistic Regression vs SVM

14. Clustering, Dimensionality Reduction and PCA

Unsupervised learning, we give the training data X with no corresponding y values. We can do this with Clustering algorithms. K-means algorithm is a good starting example.

K-Means Clustering

  • Works by placing two cluster centroids in our data set (or randomly selecting a data point)
  • For this example we will choose 2 centroids, now all the other data points will join either of these cluster centroids by which one they are closer to
  • Next, the centroids will move to the ‘averaged middle’ of these data points and repeat the process until they have converged

K-Means Step 1 K-Means Step 2 K-Means Step 3 K-Means Step 4

  • The algorithm is outlined below into a two step process
    1. Cluster assignment step
    2. Move centroid

K-Means Algorithm

K-means optimization objective

  • This algorithm is optimizing it’s version of the cost function in order to minimize the distance between the centroids and data points
  • There is the possibility that when we are initializing our centroids they can converge at a local minimum instead of the global minimum
  • In order to mitigate this, we use random initialization many times and pick the clustering that gave the lowest cost
  • When choosing the value of K we can use the elbow method, but in general this is not a reliable method
    • You should instead evaluate K-means based on a metric for how well it performs for that later purpose
    • EX: T-shirts are pricey to order in 5 sizes instead of 3 so choose k = 3

K-Means Optimization Objective

Before we do Dimensionality Reduction, we want to do some Data preprocessing This means feature scaling and mean normalization.

Data Preprocessing

Now we can reduce data from a higher dimension (say 3D) to a lower dimension (say 2D)

  • Now PCA works to reduce the squared projection errors and find a vector u(1) (or u(1) and u(2) if we are going from 3D to 2D) to project onto.
  • In the example below, we are projecting data with 3 coordinates onto a 2D plane

Data Compression

Principal Component Analysis

  • This analysis works by calculating the projection error between our data points on the surface we are projecting it onto
  • Then we are simply doing another optimization of this cost function
  • NOTE: this is different than linear regression - LR measures error distance strictly vertically while PCA is orthogonal to the plane being projected onto
    • Can account for differences despite being very visually similar

PCA Problem Formation

PCA Algorithm Visual

PCA Algorithm

  • Reduce data from n-dim to k-dim
  • Compute the “covariance matrix”:
    • Covariance Matrix
  • Compute “eigenvectors of matrix Sigma (E)
    • We can do this easily by using the svd() function
    • This gives us 3 matrices U, S, V where U is an n x n matrix that has these u vectors
    • PCA Summary

Number of Principal Components

  • In order to choose we will have to test multiple
  • We take the Average Squared Projection Error and the Total Variation in the data and evaluate:
    • Error and Variance Evaluation
  • If this is < 0.01 we say “99% of variance is retained” so we can have a set value in order to determine our k
  • We can also use that SVD function from earlier and use the S matrix calculated

Choosing K Choosing K II

Application of PCA

  • Compression
    • Reduces memory/disk needed to store data
    • Speeds up learning algorithm
    • Done by choosing k by % of variance retained
  • Visualization
    • K = 2 or 3
  • Do Not Use…
    • Trying to reduce overfitting, it may work okay but regularization is a better idea
    • Should always run with original/raw data and if it does not do what you want then reduce
15. Anomaly Detection and Recommender Systems

Main idea consists of creating a density estimation of the dataset. Then check whether our x-test is anomalous by determining if it is inside our density estimation ( p(x) ). We do this by comparing our value to Epsilon.

Density Estimation

Using mean and variance we can create a distributed (normal) Gaussian from x in order to plot a bell curve. The integral of the curve will always be 1 (nature of the equation).

  • As you decrease the variance, the graph gets narrower (or wider if increased)
  • Mean value will determine where the graph is centered

Gaussian (Normal) Distribution Gaussian (Normal) Distribution Example

Anomaly detection algorithm Anomaly Detection Algorithm

Classifying an example (y = 0 or 1) helps us have real-number evaluation so that we can more successfully evaluate our learning algorithm. This allows us to do cross validation and use different evaluation metrics (precision, recall, F1 score)

Anomaly Detection vs. Supervised Learning Anomaly Detection vs Supervised Learning

Anomaly detection

  • Fraud detection
  • Manufacturing (engines)
  • Monitoring machines in a data center

Supervised Learning

  • Email spam classification
  • Weather prediction
  • Cancer classification

Sometimes our features give us a graph that is a non-gaussian distribution.

  • We can change our features in order to try to alter the graph to appear more gaussian
  • EX: x1 = logx1 or x2 = sqrt(x2)

Multivariate Gaussian distribution

  • By modeling p(x) all in one go instead of separately we can gain more information from the graph.
  • Our parameters are: mean and Sigma (covariance matrix)
  • This can alter our distribution to look more like an ellipse instead of a circle Multivariate Gaussian Distribution Anomaly Detection w/ Multivariate Gaussian

Our relationship to the old model is the difference between the variance term and our new sigma term. This new term allows us to compute in one go as we are just taking the determinate of Sigma.

Original vs Gaussian

Recommender Systems

  • We can use content-based recommender systems to use ratings to create our recommendations.
  • This is going back to our gradient descent and minimization problems.
  • Note: sum of I:r(I,j)=1 means that we only sum on the terms in which we have a rated content
  • We also add a sum from j to n-u in order to include all the different persons recommendations

Problem Formulation GD Update

  • When given x or theta, we can estimate the other
  • There is also a way to minimize both simultaneously
  • The optimization equations for each individually use the same gradient term, the regularization is the difference
    • Therefore we add both to our equation
  • This gives us our collaborative filtering algorithm Optimization Objective Filtering Algorithm

Low rank matrix factorization

  • If we factor our content recommendations into matrices, we can easily create predictions
  • Now we can use mean normalization in order to make recommendations on things we have little data on Low Rank - Collaborative Filtering Mean Normalization
16. Large Scale Machine Learning and Pipeline

Previously, we learned that in many cases it is not a question of which algorithm runs the best but instead who has more data. If we have a learning curve with high variance, we can add more training examples in order to decrease our error, while with high bias adding more data won’t do much.

Gradient Descent

  • While GD can be very accurate, it can be very computationally expensive if we have say 300,000,000 training examples as we have to run through all 300 million before we move one step
  • Normal Gradient Descent is usually called Batch gradient descent

Stochastic Gradient Descent

  • Here we only take one training example and compute the gradient
  • This may not take us directly to the global minimum but its avg will
  • Especially if we gradually lower our step size

Mini Batch

  • A sort-of in-between method, we do not use all examples nor only 1 but instead b examples (maybe b = 10 or = 50)

Checking for Convergence Checking for Convergence

Online Learning: We get information corresponding to a user/content and forever update our cost function in order to better optimize our algorithm as new data comes in

Map Reduce We are basically splitting the training set into smaller subsets and combine our results at the end. This takes pressure of our computational expense if we disperse between different machines/cores

Pipeline The idea of splitting a project into smaller tasks that can be focused on by individual groups

  • We can create data using artificial data synthesis in order to increase our training data size
    • Can be done by adding background noise in audio or distortion to an image Also need to evaluate how helpful doing certain tasks will actually be to the overall objective
  • Does getting more data really help?
    • Be sure to check if its low bias (use learning curves)
  • How much work would it take to get more data?
    • Collect and label it yourself can be easier that artificial data synthesis sometimes
    • Also crowd sourcing
  • Deciding on what part of pipeline should have most time dedicated to it
    • Manually correct different aspects and see how your accuracy improves
    • Can be a great way to see what requires more time and to avoid wasted time