Most Intuitive Explanation of Activation Functions

Shashwat Tiwari
Good Audience
Published in
8 min readApr 14, 2019

--

Hi Folks!

Human Brains sent a lot of signals if it got really exciting. The rate of emission of these signals tells us the intensity of the original stimulus. These are Human action and these are highly related to each other as stronger the signal higher the frequency of action to be taken.

Whole Ideology is that these potential Actions can be thought of Activations Functions in terms of Neural Networks. The path that needs to be fired depends on the activation functions in the preceding layers just like any physical movement depends on the action potential at the neuron level. The path that is needed to be taken depends on Activation Functions in the subsequent layers in the same way physical Movement depends on action at the neuron level.

Source:https://ask.learncbse.in/t/what-is-a-neuron-write-the-structure-and-functions-of-a-neuron/26157

In terms of Neural Networks, they are trained by updating and adjusting neurons weights and biases by using Stochastic Gradient Descent utilizing the supervised learning back-propagation algorithm.

Each Artificial Neural Networks have one or more input like x1,x2,x3 and output value of y that passes to the next layer of ANN. They are basically used to introduce Non-Linearity to our Network. The Main goal of Activation function is to convert the input signal of a node in Artificial Neural Nets to the output signal. The Output Signal from this node will be used as input to the Next layer in Stack.

Moreover In ANN basically we have the sum of products of input x1 with their corresponding weights W then we apply our Activation Function to get the output of our layer and feed this output as input in next layer.

Need of Activation Functions

Before Jumping into Need of Activation functions. Let’s talk about linear function. In Mathematics Linear function is just a polynomial function of degree one. These Linear Functions are very limited to solve complex problems and has less ability to learn the mapping from data. If you consider an ANN without Activation function it will behave like Simple Linear Regression which has limitation to perform well on Big Datasets. In Neural Network we will not able to learn complex data types such as images, videos, audio, and speech, etc. without using an Activation function.

Activation functions — Invincible Properties

  1. Non Linearity

We discussed Earlier that the purpose of using Any Activation function is to allow Non-linearity into your Deep learning Network. It turns your model’s Target Variable varies Non-linear with your Independent Variables. Hence Non-Linear means that the output cannot be reproduced from a linear combination of the inputs.

2. Monotonic

With Activation function as Monotonic, it loss associated with the single perceptron model is supposed to be converged.

3. Range

The Range of Activation function is supposed to be finite or infinite totally depends on the gradient-based training methods. If Range is finite then training methods become more stable and it will affect some training weights On the other Hand If weights are infinite then it significantly affects most of the weights.

4. Continuously differentiable

This property is used for optimizing the gradient descents. The binary step activation function is not differentiable at 0, and it differentiates to 0 for all other values, so gradient-based methods can make no progress with it.

Now In this Post, we will be taking some different approach to Walkthrough some of the Popular Activations functions and providing some insights on how wrappers of these activation function give output!

Intuitive Explanation of Different Activation Functions

Sigmoid - Key Points

  • The sigmoid Activation function is one of the widely used Activation function. It is defined as follows
  • Basically, Sigmoid Function is a smooth and continuously differentiable function.
  • Cool Feature that Sigmoid inherit is that these have over step and has a linear function is that it is non-linear in nature.
  • The sigmoid function gives an ‘S’ shaped curve. This curve has a finite limit of ‘0’ as x approaches −∞ ,‘1’ as x approaches +∞.
  • Besides its awesome features, it also has some drawbacks also, Majorly sigmoid functions have values only range from 0 to 1. This means that the sigmoid function is not symmetric around the origin and the values received are all positive.
  • Now In order to get more Clear with the output of the sigmoid function. Below code is the wrapper for playing with the Sigmoid. Function sigmoid takes numpy array as an argument and throws back the output of the sigmoid function.
x = np.array([1,2,3,4,5,6])
x = sigmoid(x)
print(x)
[ 0.73105858 0.88079708 0.95257413 0.98201379 0.99330715 0.99752738]
  • So You are thinking that it is some sort of Magic No, I just transformed the Sigmoid function formula into the Numpy expression and out the value. As you can observe that output array ranges from (0,1) which is the basic function of using Sigmoid Activation function.

ReLu-Key Points

  • One of the most Popular Activation Function that a lot of Data Scientist and DNN Engineer fond of. The ReLU function is the Rectified linear unit. The Relu function is zero for negative values, and it grows linearly
    for positive values.
  • Relu function is Non-linear function that easily Backpropagates errors and has a number of neurons that is activated by Relu activation function.
  • So Why Relu is so popular? Because the Relu Activation function does not activate all neurons of Neural network layer at the same time.
  • Hence if we input negative values into the Relu Activation function it will convert into zero and only a few neurons are activated making the neural network sparse so it has more efficiency and ease for computation.
x = np.array([[2, -7, 5], [-6, 2, 0]]) 
x = relu(x)
print(x)
array([[2, 0, 5],
[0, 2, 0]])
  • Awesome we input some Negative values into our wrapper function and it squashes them close to zero.

Tanh -Key Points

  • Tangent Hyperbolic function is very much similar to Sigmoid Activation. This function scales value between -1 and 1 this can be achieved by applying threshold just like a sigmoid function.
  • One of the Advantages of using tangent hyperbolic function is that the values of a tanh are zero centered which helps the next neuron during propagating.
  • As we get inference from the above equations it basically tells us that when we apply the weighted sum of the inputs in the tanh activation function, it rescales the values between -1 and 1.
  • The large negative numbers are scaled towards -1 and large positive numbers are scaled towards 1.
x = np.array([1,2,3,-4,-5,-6])
x = tanh(x)
print(x)
[ 0.76159416 0.96402758 0.99505475 -0.9993293 -0.9999092 -0.99998771]
  • As the output of our wrapper clarifies everything values ranges between -1 and 1 as we input from out numpy arrays.

ArcTan -Key Points

  • This Activation function designed to achieve output values ranges between (−π/2,π/2)(−π/2,π/2).
  • So By using ArcTan Activation function, we converge our derivatives quadratically against zero for large input values. So if you some picture in your mind that In Sigmoid Activation function, derivative converges exponentially against zero, which can cause problems during back-propagation.
  • ArcTan is considered to be faster than Tanh Activation function as it has a better ability to differentiate between similar input values.
x = np.array([1,2,3,-4,-5,-6])
x = arctan(x)
print(x)
[ 0.78539816 1.10714872 1.24904577 -1.32581766 -1.37340077 -1.40564765]
  • As Discussed, We got the output values between the specified ranges i.e. (−π/2,π/2)(−π/2,π/2).

Binary Step Activation Function -Key Points

  • One of the most simple activation function.Step function is used for making a prediction when we are dealing with Binary Classifier.
  • If We are making a prediction on grounds of saying whether this will be “yes” or “no”, then we go for Step Binary activation function.
  • Moreover, the gradient of the step function is zero. This makes the step function not so useful since during back-propagation.
  • Gradients of this function basically squash to zero hence not making much improvement in DNN Model.
x = np.array([[2, -7, 5], [-6, 2, 0]]) 
x = step(x)
print(x)
[[1 0 1]
[0 1 0]]
  • So As we observe that our output come in the form of Binary Class i.e. 0 or 1.

Gaussian Activation Functions -Key Points

  • Gaussian Activation function comes from the special class of function known as radial basis functions (RBFs) are used in RBF networks.
  • These functions are Bell-Shaped Curves that comes with the properties of having continuous.
  • The Output Node of the Gaussian Activation function is meant to be interpreted in terms of “1” or “0”, depending on how close the input is to a chosen value of average.
x = np.array([[2, -7, 5], [-6, 2, 0]]) 
x = gaussian(x)
print(x)
[[0 0 0]
[0 0 1]]
  • As Expected there will be output values which considered to be Binary values having “0” or “1”.

Which Activation Function to Use? -Big Question for Now!

We have gone through a lot of Activation function but One thing I would like to point out here that there is No thumb Rule available on which activation to use in our Neural Network.

However, we can figure it out based on the properties of the problem we might be able to make a better choice for better efficiency of our Neural Nets. Here are Some End Notes which will be helpful in taking those complex Decisions.

  • Mostly Sigmoid and Relu Activation function are not used frequently due to Vanishing Point Gradients.
  • Sigmoids and their associated function work great when dealing with Binary Classification problems.
  • We need to keep in mind that Relu must be used in Hidden Layers in our Neural Networks.
  • We can also Relay on using Relu for most of the cases. If we have an issue of dealing Dead Neurons in our Neural Nets then, we can go for Leaky Relu .
  • Last but not Least, You can begin with using ReLU function and then move over to other activation functions in case Relu does not give you better Result.

End Notes

In this Post, I have discussed the various types of activation functions and their respective Wrappers in order to get you a better understanding.

References

If you like this post, please follow me as well as Press that Clap button as long as you think I Deserve it. If you have noticed any mistakes in the way of thinking, formulas, animations or code, please let me know.

Also, check out this Superb Post on Story Behind the Convolutional Neural Networks (CNN) with PyTorch Part I

Cheers!

--

--

Senior Applied Data Scientist at EY || Machine Learning and Deep Learning Ardent ||