The softmax activation function is used in neural networks when we want to build a multi-class classifier which solves the problem of assigning an instance to one class when the number of possible classes is larger than two.

In this article, I am going to explain the reason why we use softmax and how it works.


We use softmax as the output function of the last layer in neural networks (if the network has n layers, the n-th layer is the softmax function). This fact is important because the purpose of the last layer is to turn the score produced by the neural network into values that can be interpreted by humans.

To understand the softmax function, we must look at the output of the (n-1)th layer. In this layer, the values get multiplied by some weights, passed through an activation function and aggregated into a vector which contains one value per every class of the model.

For the sake of this example, we may interpret the score produced by the layer n-1 as the number of votes in a weighted voting system in which some neurons are more important than others.

Max arg

Here comes the tricky part. The softmax function is, in fact, an arg max function. That means that it does not return the largest value from the input, but the position of the largest values.

We interpret the result of the softmax function as the probability of the class, so the softmax function works in the following way:

Given a vector of numbers (the scores from the (n-1)th layer) It returns the probability of the largest value being the i-th element of the vector. For example, if I have an input:

X = [13, 31, 5]

and pass it to the softmax function, I am going to get:

array([1.52299795e-08, 9.99999985e-01, 5.10908895e-12])

99.9999985 % probability that the largest value in the array is in the second position.

Why not 100%?

That may look strange. I mean, we see what the largest value is. Why didn’t it return 100%?

To understand it we must remember about the fact that neural networks are universal function approximators. We can build a neural network that approximates the value of any mathematical function, but that is just an approximation, not an exact result. We use softmax to embrace that uncertainty and turn it into a probability interpretable by people.

If you ever see, 1 as the result of softmax function it is likely caused by limitations of the floating-point arithmetic algorithms which make some values not-representable in computer memory.

softmax([0, 100, 0])
//array([3.72007598e-44, 1.00000000e+00, 3.72007598e-44])
Older post

How to increase accuracy of a deep learning model

Debugging a machine learning model

Newer post

The silly mistakes in exploratory data analysis

My most interesting Data Analysis failures