Is ReLU differentiable? What to do when it’s not differentiable?

quangngoc

The ReLU (Rectified Linear Unit) activation function is differentiable everywhere except at precisely 0. This is because it has a sharp corner at 0 where the derivative is undefined.
This non-differentiability at 0 is often referred to as the "dying ReLU" problem, as it can cause issues during training. Neurons with ReLU activation can become stuck in a state where they always output zero, and their weights never get updated (the gradient is zero), which prevents them from learning.
To address the non-differentiability of ReLU at 0, several variants have been proposed:

Leaky ReLU: Leaky ReLU introduces a small slope (usually a small positive constant, like 0.01) for the negative input range. This small gradient allows information to flow through the neuron even when the input is negative, helping to mitigate the "dying ReLU" problem.

Leaky ReLU(x) = x, if x > 0
Leaky ReLU(x) = 0.01x, if x <= 0

Parametric ReLU (PReLU): PReLU extends Leaky ReLU by making the slope of the negative range a learnable parameter. This allows the network to adaptively adjust the slope during training.

PReLU(x) = x, if x > 0
PReLU(x) = a * x, if x <= 0, where 'a' is a learnable parameter.

Exponential Linear Unit (ELU): ELU is another alternative to address the dying ReLU problem. It smoothly approaches zero for negative inputs and is differentiable everywhere.

ELU(x) = x, if x > 0
ELU(x) = a * (exp(x) - 1), if x <= 0, where 'a' is a hyperparameter.

These variants of ReLU help mitigate the issues caused by non-differentiability at 0, and they are widely used in deep neural networks to improve training stability and convergence. The choice between ReLU, Leaky ReLU, PReLU, or ELU often depends on the specific problem and empirical experimentation to determine which activation function works best for a given network architecture.