Gradient
Derivatives
- We can also interpret the slope as:
- If I nudge x, how does y change?

Pretend that multivariable funciton is univariate, then take derivative as normal. This is a partial derivative.

- The gradient extends the derivative to multiple variables.
- Vector-valued function (always outputs a vector).
- The gradient of f(θ) w.r.t. θ is a vector. Each element is the partial derivative for component of θ.

Using the Gradient
Gradient w.r.t. θ encodes where θ should "move".
- Left: θ should move in the positive direction
- Right: θ should move in negative direction

- Left plot: gradient - , want θ +
- Right plt: graident +, want θ -
- Idea: Move θ in negative direction of gradient.
Gradient descent algorithm: nudge θ in negative gradient direction until θ converges.
Batch gradient descent update rule:

θ: Model weights
L: loss function
α: Learning rate, usually samll constant
y: Ture values from training data
Update model weights using update rule:

Gradient Descent for MSE
- For constant model using MSE loss, we have:

So the update rule is:

The Learning Rate

- α called the learning rate of gradient descent.
- Higher values can converge faster since θ changes more each iteration.
- But α too high might cause GD to never converge!
- α too low makes GD take many iterations to converge.
Question:

Answer:

In Matrix Form:

Update Rule:

Stochastic Gradient Descent (SGD)
- Can perform update using one data point instead of all the data points.

SGD usually converges faster than BGD in practice!
SGD Algorighm
- Initialize model weights to zero
- Shuffle input data.
- For each point in the shuffled input data, perform update:
- Repeat steps 2 and 3 until convergence.

Each individual update is called an iteration.
Each full pass through the data is called an epoch.
Example:
- BGD update rule for constant model, MSE loss:

- SGD update rule for constant model, MSE loss:

Why does SGD perform well?
Intuition: SGD "does more" with each data point.
- E.g. Sample of size 1 million
- BGD requires ≥1 million calculations for one update
- SGD will have updated weights 1 million times!
- However:
- BGD will always move towards optima
- SGD will sometimes move away because it only considers one point at a time.
- SGD still does generally better in spite of this.
Convexity
For a convex funtion f, any local mimnimum is also a global minimum.
- If loss function convex, gradient descent will always find the globally optimal minimizer.
Formally, f is convex if:

- RTA: If I draw a line between two points on curve, all values on curve need to be on or below line.
- MSE loss in convex:

Huber Loss
- What if we want the outlier-robust properties of MAE?
- Use Huber Loss:

- Behaves like MAE loss when points far from θ
- But like MSE loss when points close to θ

- ∂ is a parameter that determines where Huber loss transitions from MSE to MAE.
- No analytic solution (unlike MSE and MAE)
- But differentiable everywhere, so we can use GD!
- Try taking the derivative and running GD for practice.
'Computer Science π > Machine LearningπΌ' μΉ΄ν κ³ λ¦¬μ λ€λ₯Έ κΈ
| Foundation of Machine Learning (0) | 2023.05.31 |
|---|---|
| SQL in Pandas Review (0) | 2023.05.30 |
| Text Fields Review (0) | 2023.05.30 |
| EDA Review (0) | 2023.05.30 |
| Data Cleaning Review (0) | 2023.05.28 |