Classification is a type of machine learning algorithm that is used to predict the class or category of a given input. This is as opposed to linear regression which I wrote about last week. Unlike linear regression which aims to output a number (based on the square footage of my house and the past year's worth of data, what can I expect my house to sell for), classification only has a handful of different outputs at the end (Is this a dog or a cat). In this article, I'll focus on the most widely used classification method, logistic regression.
First I think we want to briefly look at why we can't just use linear regression and assign a threshold.
Linear regression is not ideal for classification tasks because:
It focuses on finding the best fit line, which may shift with additional training examples, thus altering the decision boundary—the point at which we decide the classification.
It can lead to incorrect classifications, especially for categorical data. For example, tumors larger than a certain size might still be classified as malignant, even if they are benign.
The above image illustrates how linear regression fails to correctly classify data points for categorical data. This is because linear regression will attempt to give us a curve.
While linear regression might sometimes get lucky and classify the data correctly, it's not reliable for categorical data. If we're try to classify a tumor, accuracy becomes very important.
Binary classification refers to the type of classification where there are only two possible outcomes, such as "Yes/No" or "True/False." The example above with a tumor would be binary if we had benign or malignant.
Classification algorithms can have more than two categories. For example, we may want to further classify the tumor as benign, malignant type 1, or malignant type 2. This may be more useful but is no longer binary.
Logistic regression is a more suitable alternative for classification tasks. It uses a logistic function to model the probability of a binary outcome and can effectively create a decision boundary for classification.
Though we're not using linear regression as stated previously, you can think of logistic regression as building on top of linear regression.
Mathematically the sigmoid uses the following equation. Here, e is a well known constant in the field of mathematics.
In the sigmoid, the value of z is also the familiar linear regression equation
Before moving forward, let's see the sigmoid in numpy code
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(z):
"""
Compute the sigmoid of z
Args:
z (ndarray): A scalar, numpy array of any size.
Returns:
g (ndarray): sigmoid(z), with the same shape as z
"""
g = 1/(1+np.exp(-z))
return g
# Generate an array of evenly spaced values between -10 and 10
z_tmp = np.arange(-10,11)
# Use the function implemented above to get the sigmoid values
y = sigmoid(z_tmp)
# Code for pretty printing the two arrays next to each other
np.set_printoptions(precision=3)
print("Input (z), Output (sigmoid(z))")
print(np.c_[z_tmp, y])
# Plot z vs sigmoid(z)
# subplot is used to create a grid of 1 row and 1 column of plots: 1 plot
# subplots(2,2) would create 4 total plots
fig,ax = plt.subplots(1,1,figsize=(5,3))
ax.plot(z_tmp, y, c="b")
ax.set_title("Sigmoid function")
ax.set_ylabel('sigmoid(z)')
ax.set_xlabel('z')
Input (z), Output (sigmoid(z))
[[-1.000e+01 4.540e-05]
[-9.000e+00 1.234e-04]
[-8.000e+00 3.354e-04]
[-7.000e+00 9.111e-04]
[-6.000e+00 2.473e-03]
[-5.000e+00 6.693e-03]
[-4.000e+00 1.799e-02]
[-3.000e+00 4.743e-02]
[-2.000e+00 1.192e-01]
[-1.000e+00 2.689e-01]
[ 0.000e+00 5.000e-01]
[ 1.000e+00 7.311e-01]
[ 2.000e+00 8.808e-01]
[ 3.000e+00 9.526e-01]
[ 4.000e+00 9.820e-01]
[ 5.000e+00 9.933e-01]
[ 6.000e+00 9.975e-01]
[ 7.000e+00 9.991e-01]
[ 8.000e+00 9.997e-01]
[ 9.000e+00 9.999e-01]
[ 1.000e+01 1.000e+00]]
Notice that the y axis is always between zero and one making this ideal for classification. Notice also that as z approaches $\infin$, the sigmoid approaches 1 and as a approaches $-\infin$, the sigmoid approaches 0.
The trouble with applying the cost function that we used for linear regression (Squared Error Cost) to the sigmoid is that it renders a non convex curve which means that there are several local minima that gradient descent will attempt to approach whereas we instead want the cost function to produce a convex curve so that we can approach one global minimum.
To achieve a more convex curve, we'll apply some logarithms. The cost function is defined as:
Where:
The cost function looks complex but if have two cases that we're classifying (we'll use 0 and 1), then then each iteration of the cost function (a loss function, $L$ represents the cost associated with a single training example) becomes:
Mathematically this works because if y is zero then the first term cancels and if y is 1 then the second term cancels.
This achives the result of making sure the cost function is convex, meaning that it has a single global minimum, making it easier to optimize using gradient descent.
Here is an example implementation of the logistic regression cost function using NumPy:
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(z):
"""
Compute the sigmoid of z
Parameters
----------
z : array_like
A scalar or numpy array of any size.
Returns
-------
g : array_like
sigmoid(z)
"""
# clip is used just to protect against overflow
z = np.clip( z, -500, 500 )
g = 1.0/(1.0+np.exp(-z))
return g
def compute_cost_logistic(X, y, w, b):
"""
Computes cost
Args:
X (ndarray (m,n)): Data, m examples with n features
y (ndarray (m,)) : target values
w (ndarray (n,)) : model parameters
b (scalar) : model parameter
Returns:
cost (scalar): cost
"""
m = X.shape[0]
cost = 0.0
# iterate over each training example
for i in range(m):
z_i = np.dot(X[i],w) + b
f_wb_i = sigmoid(z_i)
cost += -y[i]*np.log(f_wb_i) - (1-y[i])*np.log(1-f_wb_i)
# divide by number of traing examples
cost = cost / m
return cost
X_train = np.array([[0.5, 1.5], [1,1], [1.5, 0.5], [3, 0.5], [2, 2], [1, 2.5]]) #(m,n)
y_train = np.array([0, 0, 0, 1, 1, 1])
w_array1 = np.array([1,1])
b_1 = -3
w_array2 = np.array([1,1])
b_2 = -4
print("Cost for b = -3 : ", compute_cost_logistic(X_train, y_train, w_array1, b_1))
print("Cost for b = -4 : ", compute_cost_logistic(X_train, y_train, w_array2, b_2))
Cost for b = -3 : 0.36686678640551745
Cost for b = -4 : 0.5036808636748461
The functions for these are almost identical in the gradient descent functions for linear regression except that we are using j as a subscript to denote the feature. Also, keep in mind that the definitions of $(f_{w,b}(x^{(i)})$ is now different
It's the same concept other than that. We are using simultaneous updates each pass of these functions and our goal to have gradient descent converge to a global minimum.
So let's apply this to some code! :)
import copy, math
import numpy as np
def sigmoid(z):
"""
Compute the sigmoid of z
Parameters
----------
z : array_like
A scalar or numpy array of any size.
Returns
-------
g : array_like
sigmoid(z)
"""
z = np.clip( z, -500, 500 ) # protect against overflow
g = 1.0/(1.0+np.exp(-z))
return g
def compute_gradient_logistic(X, y, w, b):
"""
Computes the gradient for logistic regression
Args:
X (ndarray (m,n): Data, m examples with n features
y (ndarray (m,)): target values
w (ndarray (n,)): model parameters
b (scalar) : model parameter
Returns
dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w.
dj_db (scalar) : The gradient of the cost w.r.t. the parameter b.
"""
m,n = X.shape
dj_dw = np.zeros((n,)) #(n,)
dj_db = 0.
for i in range(m):
f_wb_i = sigmoid(np.dot(X[i],w) + b) #(n,)(n,)=scalar
err_i = f_wb_i - y[i] #scalar
for j in range(n):
dj_dw[j] = dj_dw[j] + err_i * X[i,j] #scalar
dj_db = dj_db + err_i
dj_dw = dj_dw/m #(n,)
dj_db = dj_db/m #scalar
return dj_db, dj_dw
def gradient_descent(X, y, w_in, b_in, alpha, num_iters):
"""
Performs batch gradient descent
Args:
X (ndarray (m,n) : Data, m examples with n features
y (ndarray (m,)) : target values
w_in (ndarray (n,)): Initial values of model parameters
b_in (scalar) : Initial values of model parameter
alpha (float) : Learning rate
num_iters (scalar) : number of iterations to run gradient descent
Returns:
w (ndarray (n,)) : Updated values of parameters
b (scalar) : Updated value of parameter
"""
# An array to store cost J and w's at each iteration primarily for graphing later
J_history = []
w = copy.deepcopy(w_in) #avoid modifying global w within function
b = b_in
for i in range(num_iters):
# Calculate the gradient and update the parameters
dj_db, dj_dw = compute_gradient_logistic(X, y, w, b)
# Update Parameters using w, b, alpha and gradient
w = w - alpha * dj_dw
b = b - alpha * dj_db
# Save cost J at each iteration
if i<100000: # prevent resource exhaustion
J_history.append( compute_cost_logistic(X, y, w, b) )
# Print cost every at intervals 10 times or as many iterations if < 10
if i% math.ceil(num_iters / 10) == 0:
print(f"Iteration {i:4d}: Cost {J_history[-1]} ")
return w, b, J_history #return final w,b and J history for graphing
Whew! Now that we have the functions written, let's see the cost function run for 10,000 iterations and see how it performs. Remember, the whole point of running gradient descent is to find our weights and bias terms that we'll use in our model.
w_tmp = np.zeros_like(X_train[0])
b_tmp = 0.
alph = 0.9
iters = 10000
w_out, b_out, _ = gradient_descent(X_train, y_train, w_tmp, b_tmp, alph, iters)
print(f"\nupdated parameters: w:{w_out}, b:{b_out}")
Iteration 0: Cost 0.6509898706978229
Iteration 1000: Cost 0.01898509708803807
Iteration 2000: Cost 0.009462945855308616
Iteration 3000: Cost 0.006299017770009604
Iteration 4000: Cost 0.004720359852320092
Iteration 5000: Cost 0.003774414835001944
Iteration 6000: Cost 0.0031443437189356085
Iteration 7000: Cost 0.0026945747305561125
Iteration 8000: Cost 0.0023574030434657897
Iteration 9000: Cost 0.0020952495092537446
updated parameters: w:[8.21 8.01], b:-22.305241709195236
Now's the easy part, all we need to do to make our prediction is just apply the sigmoid function. Remember, the z value in the sigmoid is the dot product of the training data along with the weights and bias that we calculated earlier.
def predict(X, w, b):
"""
Predict whether the label is 0 or 1 using learned logistic
regression parameters w
Args:
X : (ndarray Shape (m,n)) data, m examples by n features
w : (ndarray Shape (n,)) values of parameters of the model
b : (scalar) value of bias parameter of the model
Returns:
p : (ndarray (m,)) The predictions for X using a threshold at 0.5
"""
# number of training examples
m, n = X.shape
p = np.zeros(m)
# Loop over each example
for i in range(m):
f_wb_i = sigmoid(np.dot(X[i],w) + b)
# Apply the threshold
p[i] = 1 if f_wb_i>=0.5 else 0
return p
Now we just call the function
# Test your predict code
np.random.seed(1)
tmp_w = np.random.randn(2)
tmp_b = 0.3
tmp_X = np.random.randn(4, 2) - 0.5
tmp_p = predict(tmp_X, tmp_w, tmp_b)
print(f'Output of predict: shape {tmp_p.shape}, value {tmp_p}')
Here we've implemented logistic regression. To reiterate, we couldn't use linear regression directly for classification which I posted on last week but we can use logistic regression which builds on the idea.