In my previous post on introducing neural networks, I introduced the basics of neural networks, including their structure, activation functions, and a brief introduction to TensorFlow. Now, we can expand on these concepts and explore some more advanced topics in neural network architecture and training.
Let's begin by revisiting the process of building a neural network, comparing implementations in TensorFlow and NumPy. This will help solidify our understanding of the underlying operations before we move on to more complex topics.
TensorFlow, with its high-level Keras (Remember Keras was a separate ML library originally that Google eventually integrated into tensorflow) API, makes it straightforward to build a neural network. Notice that during the model compilation step, we're using Adam which we will define in more detail later. The naming of BinaryCrossentropy for the loss function has it's roots in statistics. Binary just emphasizes that this is based on two possible classes and cross entropy is a statisical term just indicating the loss between the true value and predicted value.
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
# Define the model
model = tf.keras.Sequential([
tf.keras.Input(shape=(2,), batch_size=32),
Dense(3, activation='sigmoid', name='layer1'),
Dense(1, activation='sigmoid', name='layer2')
])
# Compile the model
model.compile(
loss=tf.keras.losses.BinaryCrossentropy(),
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
)
# Train the model
model.fit(
X_train, y_train,
epochs=10,
)
This TensorFlow code defines a simple neural network with one hidden layer (3 neurons) and an output layer (1 neuron), both using sigmoid activation functions.
Now, let's look at how we might implement the same network structure using NumPy:
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def forward_pass(X, W1, b1, W2, b2):
# Layer 1 (Dense layer with sigmoid activation)
Z1 = np.dot(W1, X) + b1
A1 = sigmoid(Z1)
# Layer 2 (Output layer with sigmoid activation)
Z2 = np.dot(W2, A1) + b2
A2 = sigmoid(Z2)
return A2 # This is the output of the forward pass
# Initialize weights and biases
W1 = np.random.randn(3, 2) # 3x2 for 3 neurons in hidden layer, 2 input features
b1 = np.zeros((3, 1))
W2 = np.random.randn(1, 3) # 1x3 for 1 output neuron, 3 neurons in hidden layer
b2 = np.zeros((1, 1))
# Forward pass
output = forward_pass(X_train, W1, b1, W2, b2)
# Note: This doesn't include the training process, which would involve
# implementing backpropagation and gradient descent
This NumPy implementation explicitly shows the matrix operations happening in each layer of the network.
While the TensorFlow implementation abstracts away many details, the NumPy implementation gives us a clearer picture of what's happening "under the hood". In both cases:
The key difference is that TensorFlow handles a lot of the complexity for us, including the backpropagation (more on this later) and optimization processes during training.
We previously discussed the sigmoid activation function and it's use in binary classification. But when it comes to activation functions, sigmoid is not the only game in town.
The sigmoid function, defined as
$$ g(z) = \frac{1}{1+e^{-z}} $$
maps input to a value between 0 and 1. This makes it particularly useful for binary classification tasks where we want to interpret the output as a probability.
Key characteristics of the sigmoid function:
ReLU, defined as
$$ f(x) = \max(0, x) $$
has become the go-to activation function for hidden layers in modern deep learning models.
ReLU is a simple function but provides a unique feature in that it only has one point where it can flatten out as shown in the image below. This translates to only having half as many minima as the sigmoid. This means that gradient descent will have an easier time reaching that global minimum thus the learning rate will be better.
Linear activation functions are what we're used to and are just
$$ z = \vec{w} \cdot \vec{x} + b $$
While our previous post focused on binary classification, many real-world problems involve multiple classes. This is where the softmax function comes in.
The softmax function takes a vector of arbitrary real-valued scores and squashes it to a vector of values between 0 and 1 that sum to 1. This makes it perfect for multi-class classification where we want to interpret the outputs as probabilities.
The softmax function is defined as:
$$ a_j = \frac{e^{z_j}}{\sum_{k=1}^{N} e^{z_k}} = P(y = j | \vec{x}) $$
where
$$ z_j = \vec{w}_j \cdot \vec{x} + b_j \quad j = 1, \dots, N $$
Suppose we have 4 possible outputs, P1 through P4, using the softmas the probabilities for all of these will have to sum to 1. Below are examples of those 4. Notice the denominator is the same for all 4.
One important takeaway is that if we use a softmax regression model on only two possible outcomes, we arrive at the same answer as using the sigmoid. This means that logistic regression is really just a special case of softmax regression.
Switching over to the loss and cost function, this becomes a piecewise function now where we're taking -log for every a value that we compute using softmax regression
$$ loss(a_1, \dots, a_N, y) = \begin{cases} -\log a_1 & \text{if } y = 1 \ -\log a_2 & \text{if } y = 2 \ \vdots & \ -\log a_N & \text{if } y = N \end{cases} $$
Here's a simple implementation of the softmax function:
def my_softmax(z):
a = np.zeros_like(z)
n = len(z)
for j in range(0,n):
denominator = 0
for k in range(0,n):
denominator += np.exp(z[k])
a[j] = np.exp(z[j])/denominator
return a
In TensorFlow, we can easily incorporate softmax into our model:
model = Sequential(
[
Dense(25, activation = 'relu'),
Dense(15, activation = 'relu'),
Dense(4, activation = 'softmax')
]
)
model.compile(
loss=tf.keras.losses.SparseCategoricalCrossentropy(),
optimizer=tf.keras.optimizers.Adam(0.001),
)
model.fit(
X_train,y_train,
epochs=10
)
Notice that we've now used SparseCategoricalCrossentropy which we use for multiclass losses.
In our previous TensorFlow example, we used the Adam optimizer. Adam (Adaptive Moment Estimation) is an advanced optimization algorithm that adapts the learning rate for each parameter.
Adam keeps track of a moving average of past gradients to adjust its learning rate. This often leads to faster convergence and better performance across a wide range of tasks. This can also help to prevent overfitting.
While we previously focused on dense (fully connected) layers, convolutional layers offer significant advantages for certain types of data, particularly in image processing tasks.
Convolutional layers work by having different neurons look at different parts of the input data. This approach has several advantages:
Here's an example of how a CNN might be used to analyze EKG data for heart disease prediction:
CNN architecture for EKG data analysis. This diagram shows a neural network with two convolutional hidden layers and a final output layer using the sigmoid activation function for binary classification of heart disease indicators.
In this example:
When working with neural networks, especially deep ones, numerical accuracy becomes crucial. Roundoff errors due to how computers store floating-point numbers can accumulate over many computations.
To address this, particularly in large or deep models, we can use a technique called logits. Instead of applying the softmax activation in the output layer, we output raw scores (logits) by just specifying a linear function as the activation and apply softmax afterwards.
In TensorFlow, this looks like:
model = Sequential(
[
Dense(25, activation = 'relu'),
Dense(15, activation = 'relu'),
Dense(4, activation = 'linear') # Note: linear activation here
]
)
model.compile(
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer=tf.keras.optimizers.Adam(0.001),
)
When predicting, we then need to apply softmax separately:
logits = model.predict(X_test)
probabilities = tf.nn.softmax(logits).numpy()
This approach can lead to more stable and accurate results, especially in deep networks.
While we've discussed binary and multiclass classification, there's another important type of classification task: multilabel classification. In multilabel classification, each input can be associated with multiple output labels simultaneously. For example, a movie can be both action and comedy.
One way to deal with this is just to have multiple neural networks but this may or may not be viable depending on the number of labels.
In multilabel classification:
A classic example of multilabel classification is image tagging. An image might contain multiple objects, and the model needs to identify all of them.
[IMAGE PLACEHOLDER: image-57.png] Figure 2: Illustration of multilabel classification. This image likely shows how a single input can be associated with multiple labels, contrasting with multiclass classification where each input belongs to only one class.
Let's consider an example of multilabel classification for image recognition:
Example of multilabel classification in image recognition. This image demonstrates how a single image might be classified with multiple labels such as "Car", "Bus", and "Pedestrian".
In this example, the model needs to detect the presence of multiple objects in the image. Each output label (Car, Bus, Pedestrian) is treated as a separate binary classification problem.
For multilabel classification, we typically use a neural network with:
Neural network structure for multilabel classification. This diagram shows how the output layer of a neural network is structured for multilabel classification, with each output unit corresponding to a different label.
The key differences in the network architecture for multilabel classification are:
Let's consider a practical example of multilabel classification using a movie genre prediction task. In this scenario, a movie can belong to multiple genres simultaneously (e.g., a film could be both "Action" and "Comedy").
Here's a simple implementation:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# Sample data: movie features (e.g., keywords, runtime, release year)
X = np.array([
[1, 120, 2020],
[0, 90, 2019],
[1, 110, 2021],
[0, 100, 2018]
])
# Labels: each movie can belong to multiple genres
# Genres: Action, Comedy, Drama, Sci-Fi
y = np.array([
[1, 1, 0, 0], # Action and Comedy
[0, 1, 1, 0], # Comedy and Drama
[1, 0, 0, 1], # Action and Sci-Fi
[0, 0, 1, 0] # Only Drama
])
# Define the model
model = Sequential([
Dense(8, activation='relu', input_shape=(3,)),
Dense(4, activation='sigmoid') # 4 output neurons for 4 genres
])
# Compile the model
model.compile(
loss=tf.keras.losses.BinaryCrossentropy(),
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
)
# Train the model
model.fit(X, y, epochs=100, verbose=0)
# Make predictions
new_movie = np.array([[1, 105, 2023]]) # New movie features
predictions = model.predict(new_movie)
print("Genre Predictions:")
genres = ['Action', 'Comedy', 'Drama', 'Sci-Fi']
for genre, pred in zip(genres, predictions[0]):
print(f"{genre}: {pred:.2f}")
In this example:
The output might look something like this:
Genre Predictions:
Action: 0.72
Comedy: 0.31
Drama: 0.15
Sci-Fi: 0.58
This output indicates that the model predicts the new movie is likely to be an Action and Sci-Fi film, with a lower probability of being a Comedy, and unlikely to be a Drama.
This example demonstrates how multilabel classification allows a single input (a movie in this case) to be associated with multiple output labels (genres) simultaneously, each with its own probability.
By building on our previous introduction to neural networks, we've now explored more advanced concepts like multiclass classification with softmax, the Adam optimizer, convolutional neural networks, and techniques for improving numerical stability.
These concepts form the foundation for many state-of-the-art neural network architectures used in industry today. As you continue your journey in deep learning, you'll find these concepts recurring and combining in novel ways to solve increasingly complex problems.
Remember, the field of neural networks and deep learning is vast and rapidly evolving. Keep experimenting, stay curious, and never stop learning!