1. Introduction
Two definitions of Machine Learning:
- “The field of study that gives computers the ability to learn without being explicitly programmed.” - Arthur Samuel (older, informal definition)
- “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” - Tom Mitchell (more modern definition)
Generally speaking, any machine learning problem can be put into one of two broad classifications:
- Supervised learning
- You are given a data set and already know what the correct output should look like as there is a relationship between the input and output
- Can be further categorized into:
- Regression Problems - trying to map input variables to some continuous function, predicting results within a continuous output
- Classification Problems - trying to map input variables into discrete categories, predicting results in a discrete output
- Unsupervised learning
- Allows you to approach problems with little or no idea what our output should look like, we are deriving structure from data where we do not know the effects of the variables
- This structure can be derived from clustering data based upon relationships among the variables in the data
2. Model and Cost Function
(xi,yi) -> training set for the model x -> input and y-> output with i = 0, 1, … , m h -> hypothesis
Linear Regression with One Variable
Hypothesis: h𝜃 (x) = 𝜃0 + 𝜃1x 𝜃i’s: Parameters
Choose 𝜃0 , 𝜃1 so that h𝜃 (x) is close to y for our training examples (x,y) Minimization problem of 𝜃0 , 𝜃1 : ( 1 / 2m) * Sum[i to m] of (h𝜃 (x) - y ) 2 This is the same as the Cost Function (or the Squared Error Function/Mean squared error)
3. Parameter Learning
Note: It must be a simultaneous update for each theta-i
Gradient descent works to take the derivative of our cost function in order to get a direction to move. The size of each step is determined by alpha - called the learning rate.
Derivative is the partial derivative of the Cost Function with respect to theta-i
It is finished once you have reached the bottom of the graph
4. Multivariate Linear Regression
Multivariate Linear Regression is just Linear Regression with more than one variable
An easy way to write the above formula is simply by h𝜃 (x) = 𝜃(transpose) * x
We need to change our Gradient Descent algorithm in order to use multiple variables. Not that for each theta-i, we are just taking the partial derivative of the cost function which gives us the xj term, removes the squared and cancels out the 2 in the 1/2m term.
Be sure to fully understand the equation:
- Theta term - represents the different weights or parameters we associate with a given feature
- Alpha - learning rate or step size
- n - number of features
- h(x) - hypothesis equation, what we are guessing to fit the data
- m - number of training sets
- x(i) - feature values from training set
- y(i) - predicted values from training set
- J(theta) - cost function or squared error function, telling you the difference between true values and model
- *** We are taking the partial derivative of J(theta) with respect to theta-j ***
Feature Scaling: If we have our features on a similar scale then gradient descents can converge more quickly. We can easily scale them by dividing the features by their maximum values, which would make them shrink to between 0 <= x <= 1. ‘Get every feature into approximately a: -1 <= xi <= 1 range.’ - not a hard set rule, as long as it is a small range that is not astronomically large or small. Try to keep between -3 to 3 or -1/3 to 1/3.
Mean Normalization: Replace xi with xi-ui to make features have approximately zero mean (Do not apply to x0 = 1). Basically just making the features have around an average mean of zero so subtract by middle value (avg value of x) and divide by range (max-min).
A good way to debug gradient descent is to plot the cost function over the number of iterations. If J(theta) is increasing then use a smaller alpha.
If alpha is sufficiently small, then J(theta) will decrease on every iteration.
Below is another tip to improve Gradient Descent:
Note: if we do try to fit our model with a quadratic, cubic or square root we do need to feature scale because these ranges will compound on one another with the squared and cubed values.
5. Computing Parameters Analytically
Normal equation: Method to solve for Theta analytically
To minimize - take derivative and set to zero
***taken from https://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression to calculate the derivation ***
*** to full understand the last step of “we derive by each component of the vector, and then combine the resulting derivatives into a vector again.” See https://eli.thegreenplace.net/2015/the-normal-equation-and-matrix-calculus/ *** -it involves the use of Matrix Identities, Symmetric Matrices and a knowledgeable understanding of partial derivatives and Matrices in general
THERE IS NO NEED TO DO FEATURE SCALING WITH THE NORMAL EQUATION
Gradient Descent
- Need to choose alpha
- Needs many iterations
- O(kn^2)
- Works well even when n is large
Normal Equation
- No need to choose alpha
- Don’t need to iterate
- Need to compute inverse O(n^3)
- Slow if n is very large
If X^T *X is non-invertible (singular vs degenerate) (Remember non-invertible is due to redundant features [linearly dependent in mathematic terms] or too many features [m<=n] where we either delete some features or use regularization)
6. Hypothesis Representation and Decision Boundary
Statistical Logistic/Sigmoid Function with asymptotes at y = 0 and y = 1 towards negative infinity and infinity respectively.
NOTE: Remember that the decision Boundary is a trait of the hypothesis not the training sets. It comes from theta = …
7. Logistic Regression Model
-log(z) if y = 1 and -log(1-z) if y = 0 are very useful for optimizing our cost function as they allow us to ‘punish’ our learning algorithm for values that deviate from our expected values due how they approach infinity along the y axis.
This also ensures our cost function will be convex so that we do not have any local minima
We can rewrite our cost function so that it only contains one line. *Remember that y can only equal 0 or 1 always as this is the mathematical definition of y
Cost(h(x),y) = -ylog(h(x)) - (1-y)log (1 - h(x))
Optimization algorithms:
- Gradient descent
- Conjugate gradient
- BFGS
- L-BFGS
Advantages (conjugate, BFGS, L-BFGS)
- No need to manually pick alpha
- Often faster than gradient decent Disadvantages:
- More complex
We have more efficient and sophisticated algorithms than gradient descent but we do not need to write them ourselves, we can instead pull from Octave libraries in order to use.
Now, if we have multiple classes we want to classify, we can use the One-vs-all classification approach. We are predicting the probability that ‘y’ is a member of one of our classes while we combine all the other classes into one ‘negative’ class compared to the characteristic we are looking for.
8. Solving the Problem of Overfitting
“Underfit”/“High Bias” - using a linear prediction model when a higher order model would be more fitting “Overfit”/“High Varience” - using too high of a polynomial for a model
Overfitting: If we have too many features, the learned hypotheses may fit the training set very well but fail to generalize to new examples.
Note: This terminology is applied to both linear and logistic regression
We can help to mitigate the issue of overfitting through two different options:
- Reducing the number of features
- Regularizing the magnitude of parameters
The idea behind regularization is giving us a “simpler” hypothesis by penalizing some of the parameters to have small values so that they become less prone to overfitting.
Modify cost function to add a term at the end (regularization term) that will include lambda (regularization parameter). It controls the tradeoff of keeping theta parameters small and fitting the training set well.
If lambda is set too high, then we will underfit the data and made our theta terms too small - basically a flat line
Regularized linear regression:
Regularized logistic regression:
9. Neural Networks
We are modeling a neural network by taking in input and performing some calculation and spitting out an output.
Sigmoid (logistic) activation function.
Netwrok has multiple layers Layer 1 - input layer Layer 2 - Hidden layer (do not observe values) Layer 3 - output layer
Essentially, our parameters are now called weights but they act in the same way. There is not a lot of new information, just renaming certain aspects that we have already covered. The new edition would be the linear progression of x -> a -> h(x).
We are essentially doing the same thing as logistic regression, our features are just now a0 a1 a2 and a3 instead of just x0 x1 x2 x3
These are learned features making this pretty complex features to help add flexibility about what it feeds into our calculation for our output layer.
Architectures refer to how the neurons are connected.
Below is an example of predicting x1 AND x2 (the logical ‘and’ operator) and the x1 OR x2, either x1 is true or x2 is true or they are both true.
Now we can make a more complex logical operator (in this case the XNOR logical operator) using a neural network.
This can be taken further to have a multiclass classification by returning a vector.
10. NN Cost Function and Backpropagation
Now we want to find parameters to minimize J(theta). So we need to compute J(theta) and partial derivative of J(theta) wrt theta.
Given a training set we want to calculate the minimization of J(theta) along with its partial derivative.
If you had two training examples (x1, y1) and (x2, y2) - you would perform Forward Propagation on x1, Back Propagation on y1, then Forward on x2 and Back on y2.
Some more walkthrough on Back-Propagation.
Now we can ‘unroll’ our parameters to make it easier to use for our higher complexity algorithms
Gradient Checking: Make sure you forward/back Prop will be working correctly and accurately
Gradient Checking is very slow and costly when compared to backprop code, which is why we turn it off in order to use our backprop.
We can use random initialization in order to ensure that we have weights that will work.
Putting it all together:
11. Evaluating a Learning Algorithm
Machine Learning Diagnostic: Testing what is and isn’t working with a learning algorithm and gain guidance to how to best improve.
High Bias -> Underfit High Variance -> Overfit
As we increase the degree of polynomial d, the Training error tends to decrease. Cross validation has a more quadratic appearance. By looking at the cross validation error, we can determine if it is underfit or overfit
Bias (underfit): J train(theta) will be high and J crossVal(theta) = J trains(Theta)
Variance (overfit): J train(theta) will be low and J crossVal(theta) »
Being sure to know what is the actual problem is the main component in trying to “decide what to do next”
12. Machine Learning System Design
A major takeaway from this section is the idea to list out all the possible ways you can approach the problem. Far too often someone will come up with an idea then instantly devote 6 months of time to that when there may be an alternative approach that would be more effective.
“Use evidence to guide our learning, not gut feeling.”
“Key test that I often ask myself are first, can a human expert look at the features x and confidently predict the value of y because thats sort of a certification that y can be predicted accurately from the features x and second, can we actually get a large training set, and train the learning algorithm with a lot of parameters in the training set and if you can do both then thats more often give you a very kind performance learning algorithm.”
13. Support Vector Machines
We take logistic regression and modify it a little bit into a straight line that mimics logistic regression.
These new functions are called cost subscript 1 of z and cost subscript 0 of z (denoting when y is equal to 1 versus when y is equal to zero).
We now add a term C to be multiplied on A in order to replicate our lambda value in our regularization term. If C = 1/lambda then the equations will be equivalent.
This is our overall optimization objective function for the SVM and if we minimize that function then we have the parameters learned by the SVM.
Unlike logistic regression, the support vector machine doesn’t output the probability but outputs one if theta transpose x is greater or equal to zero and will predict zero otherwise.
Tries to establish a large margin between its decision boundary. It is also called the large margin classifier.
If only using large margin classifier, it can be susceptible to outliers.
Kernels:
14. Clustering, Dimensionality Reduction and PCA
Unsupervised learning, we give the training data X with no corresponding y values. We can do this with Clustering algorithms. K-means algorithm is a good starting example.
K-Means Clustering
- Works by placing two cluster centroids in our data set (or randomly selecting a data point)
- For this example we will choose 2 centroids, now all the other data points will join either of these cluster centroids by which one they are closer to
- Next, the centroids will move to the ‘averaged middle’ of these data points and repeat the process until they have converged
- The algorithm is outlined below into a two step process
- Cluster assignment step
- Move centroid
K-means optimization objective
- This algorithm is optimizing it’s version of the cost function in order to minimize the distance between the centroids and data points
- There is the possibility that when we are initializing our centroids they can converge at a local minimum instead of the global minimum
- In order to mitigate this, we use random initialization many times and pick the clustering that gave the lowest cost
- When choosing the value of K we can use the elbow method, but in general this is not a reliable method
- You should instead evaluate K-means based on a metric for how well it performs for that later purpose
- EX: T-shirts are pricey to order in 5 sizes instead of 3 so choose k = 3
Before we do Dimensionality Reduction, we want to do some Data preprocessing This means feature scaling and mean normalization.
Now we can reduce data from a higher dimension (say 3D) to a lower dimension (say 2D)
- Now PCA works to reduce the squared projection errors and find a vector u(1) (or u(1) and u(2) if we are going from 3D to 2D) to project onto.
- In the example below, we are projecting data with 3 coordinates onto a 2D plane
Principal Component Analysis
- This analysis works by calculating the projection error between our data points on the surface we are projecting it onto
- Then we are simply doing another optimization of this cost function
- NOTE: this is different than linear regression - LR measures error distance strictly vertically while PCA is orthogonal to the plane being projected onto
- Can account for differences despite being very visually similar
PCA Algorithm
- Reduce data from n-dim to k-dim
- Compute the “covariance matrix”:
- Compute “eigenvectors of matrix Sigma (E)
- We can do this easily by using the svd() function
- This gives us 3 matrices U, S, V where U is an n x n matrix that has these u vectors
Number of Principal Components
- In order to choose we will have to test multiple
- We take the Average Squared Projection Error and the Total Variation in the data and evaluate:
- If this is < 0.01 we say “99% of variance is retained” so we can have a set value in order to determine our k
- We can also use that SVD function from earlier and use the S matrix calculated
Application of PCA
- Compression
- Reduces memory/disk needed to store data
- Speeds up learning algorithm
- Done by choosing k by % of variance retained
- Visualization
- K = 2 or 3
- Do Not Use…
- Trying to reduce overfitting, it may work okay but regularization is a better idea
- Should always run with original/raw data and if it does not do what you want then reduce
15. Anomaly Detection and Recommender Systems
Main idea consists of creating a density estimation of the dataset. Then check whether our x-test is anomalous by determining if it is inside our density estimation ( p(x) ). We do this by comparing our value to Epsilon.
Using mean and variance we can create a distributed (normal) Gaussian from x in order to plot a bell curve. The integral of the curve will always be 1 (nature of the equation).
- As you decrease the variance, the graph gets narrower (or wider if increased)
- Mean value will determine where the graph is centered
Anomaly detection algorithm
Classifying an example (y = 0 or 1) helps us have real-number evaluation so that we can more successfully evaluate our learning algorithm. This allows us to do cross validation and use different evaluation metrics (precision, recall, F1 score)
Anomaly Detection vs. Supervised Learning
Anomaly detection
- Fraud detection
- Manufacturing (engines)
- Monitoring machines in a data center
Supervised Learning
- Email spam classification
- Weather prediction
- Cancer classification
Sometimes our features give us a graph that is a non-gaussian distribution.
- We can change our features in order to try to alter the graph to appear more gaussian
- EX: x1 = logx1 or x2 = sqrt(x2)
Multivariate Gaussian distribution
- By modeling p(x) all in one go instead of separately we can gain more information from the graph.
- Our parameters are: mean and Sigma (covariance matrix)
- This can alter our distribution to look more like an ellipse instead of a circle
Our relationship to the old model is the difference between the variance term and our new sigma term. This new term allows us to compute in one go as we are just taking the determinate of Sigma.
Recommender Systems
- We can use content-based recommender systems to use ratings to create our recommendations.
- This is going back to our gradient descent and minimization problems.
- Note: sum of I:r(I,j)=1 means that we only sum on the terms in which we have a rated content
- We also add a sum from j to n-u in order to include all the different persons recommendations
- When given x or theta, we can estimate the other
- There is also a way to minimize both simultaneously
- The optimization equations for each individually use the same gradient term, the regularization is the difference
- Therefore we add both to our equation
- This gives us our collaborative filtering algorithm
Low rank matrix factorization
- If we factor our content recommendations into matrices, we can easily create predictions
- Now we can use mean normalization in order to make recommendations on things we have little data on
16. Large Scale Machine Learning and Pipeline
Previously, we learned that in many cases it is not a question of which algorithm runs the best but instead who has more data. If we have a learning curve with high variance, we can add more training examples in order to decrease our error, while with high bias adding more data won’t do much.
Gradient Descent
- While GD can be very accurate, it can be very computationally expensive if we have say 300,000,000 training examples as we have to run through all 300 million before we move one step
- Normal Gradient Descent is usually called Batch gradient descent
Stochastic Gradient Descent
- Here we only take one training example and compute the gradient
- This may not take us directly to the global minimum but its avg will
- Especially if we gradually lower our step size
Mini Batch
- A sort-of in-between method, we do not use all examples nor only 1 but instead b examples (maybe b = 10 or = 50)
Checking for Convergence
Online Learning: We get information corresponding to a user/content and forever update our cost function in order to better optimize our algorithm as new data comes in
Map Reduce We are basically splitting the training set into smaller subsets and combine our results at the end. This takes pressure of our computational expense if we disperse between different machines/cores
Pipeline The idea of splitting a project into smaller tasks that can be focused on by individual groups
- We can create data using artificial data synthesis in order to increase our training data size
- Can be done by adding background noise in audio or distortion to an image Also need to evaluate how helpful doing certain tasks will actually be to the overall objective
- Does getting more data really help?
- Be sure to check if its low bias (use learning curves)
- How much work would it take to get more data?
- Collect and label it yourself can be easier that artificial data synthesis sometimes
- Also crowd sourcing
- Deciding on what part of pipeline should have most time dedicated to it
- Manually correct different aspects and see how your accuracy improves
- Can be a great way to see what requires more time and to avoid wasted time