Deep learning part 2 – Recurrent neural networks (RNN)

The details of feedforward networks has been gone through in the previous post, and in this post we are going through the recurrent networks.

Recurrent networks are used to learn patterns in sequences of data, such as text, and handwriting, the spoken word, and time series data. It can also be used for the vision applications where images are decomposed into a series of patches and treated as a sequence.

Recurrent networks have two sources of information. It takes the current input and the input which perceived one step back in time as input. As it depends on current and previous inputs, it is often said that recurrent networks have memory. Recurrent networks are differentiated from feedforward networks by feedback loops. Recurrent net can be seen as a (very deep) feedforward net with shared weights.

A simple RNN architecture is shown in below diagram.RNN1

✓ x0, x1,….. XT are input, and h0,h1,… hT are hidden state of the recurrent network.

✓ The hidden states are recursively estimated as bellow,
Where W is the input to hidden weights, U is the hidden to hidden weights, and V is the hidden to label weight.

✓ Weights are estimated by minimizing following function,
✓ Backpropagation through time (BPTT) algorithm is used to estimte the weights.

Backpropagationthrough time (BPTT)
✓ Firstly, the RNN is unfolded in time.
✓ Deep neural network is obtained with sharedweights Wand U.
✓ The unfolded RNN is traied using normal backpropagation which is explained in the previous post.
✓ In practice, the number ofunfolding steps are limited to 5 –10.
✓ It is computationally more efficient topropagate gradients after few trainingexamples (batch training).

✶ As we propagate the gradients back in time,  their magnitude usually quickly decreases which is called vanishing gradient problem.

✶ Sometimes, the gradients start to increase exponentially during backpropagation through the recurrent weights. This happens rarely. The huge gradients will lead to big change of weights, and thus destroy what has been learned so far. This would be the reason why RNN are unstable. Clipping/ normalizing the values of gradients would avoid the huge changes of weights.

✶ In practice, learning long term dependencies in data is difficult for simple RNN architecture. Special RNN architectures such as long short-term memory (LSTM) address this problem.