Gradient Descent

@gradient_descent

Gradient Descent is a way for machines to learn from data by adjusting their predictions to get closer to the correct answer. It's like a never-ending staircase where the machine takes small steps down until it reaches the bottom, which is the optimal solution.

Pseudo-code
function update_weights(x, y, m, c, alpha):
  # Step 1: Calculate total error (cost)
  J = 0
  for i from 0 to len(x)-1:
    prediction = m * x[i] + c
    error = y[i] - prediction
    J += error^2
  
  # Step 2: Calculate gradients (direction to move)
  dJ_dm = 0
  dJ_dc = 0
  for i from 0 to len(x)-1:
    prediction = m * x[i] + c
    error = y[i] - prediction
    dJ_dm += -2 * x[i] * error
    dJ_dc += -2 * error
  
  # Step 3: Update parameters (take a step downhill)
  m = m - alpha * (dJ_dm / len(x))
  c = c - alpha * (dJ_dc / len(x))
  
  return m, c

function gradient_descent(x, y, m, c, alpha, iterations):
  for step from 1 to iterations:
    m, c = update_weights(x, y, m, c, alpha)
    
    # Think: each loop is one small step downhill
  
  return m, c
                
  • x Represents the input data (like hours studied, size of a house, etc.). Think of it as the 'given information'.
  • y Actual correct output. This is what we want our model to match (like real exam scores or house prices).
  • m Slope of the line. It controls how steep the line is. Imagine tilting a stick up or down.
  • c The y-intercept. It shifts the line up or down. Think of sliding the line vertically.
  • alpha Learning rate. It controls how big each step is when adjusting the model. Small steps = slow but safe, big steps = fast but risky.
  • J Cost (error). It measures how wrong the model is. Think of it like a 'score' where lower is better.
  • 0 Often used to start values (like m = 0, c = 0). This means we start with a flat line.
  • 0.01 A typical small learning rate. It ensures we take careful steps instead of jumping wildly.
  • n Number of data points. We divide by n to get the average error.
Recipe Steps
1

Start with data points. Imagine dots on a graph (like study hours vs exam scores).

2

Guess a line. Start with a random line (your first guess). It will probably be wrong.

3

Measure how wrong you are. Calculate how far your line is from the real points (this is the cost J).

4

Adjust the line slightly. Move it in the direction that reduces the error. Like adjusting a tilted board to better match points.

5

Repeat many times. Each step improves the line little by little, like walking downhill toward the lowest point.

6

Stop when changes are tiny. You have reached the best-fit line.

Questions & Answers

To minimize the cost (error) and find the best model parameters.

Because it moves downhill on the error graph toward the lowest point.

The algorithm may overshoot and never settle, like jumping over the bottom of a hill.

No, we keep updating until the error becomes small or stops improving.
- Always move in the direction that reduces error the most.
- Like walking down a foggy mountain. You cannot see the bottom, but you feel the slope and step downward.
- Too small = very slow, too big = unstable.
- Not normalizing data can make learning unstable.
- Gradient descent may find a local minimum, not always the global one.
- Plotting the cost over time helps check if learning is working.


Related Algorithms
Decision Tree (ID3 / CART)

A Decision Tree builds a flowchart-like structure that splits data step by step using features, until it can make a clear prediction.

View
K-Means Clustering

K-Means Clustering groups data points into k clusters by repeatedly assigning points to the nearest center and updating those centers.

View
Longest Common Subsequence (LCS)

The Longest Common Subsequence (LCS) finds the longest sequence of characters that appear in the same order in both sequences (not necessarily next to each other).

View