Recurrent Networks I

Consider the following two networks:

2 networks     (Fig. 1)

The network on the left is a simple feed forward network of the kind we have already met. The right hand network has an additional connection from the hidden unit to itself. What difference could this seemingly small change to the network make?

Each time a pattern is presented, the unit computes its activation just as in a feed forward network. However its net input now contains a term which reflects the state of the network (the hidden unit activation) before the pattern was seen. When we present subsequent patterns, the hidden and output units' states will be a function of everything the network has seen so far.

The network behavior is based on its history, and so we must think of pattern presentation as it happens in time.

Network topology

Once we allow feedback connections, our network topology becomes very free: we can connect any unit to any other, even to itself. Two of our basic requirements for computing activations and errors in the network are now violated. When computing activations, we required that before computing yi, we had to know the activations of all units in the posterior set of nodes, Pi. For computing errors, we required that before computing delta_{j}, we had to know the errors of all units in its anterior set of nodes, Ai.

For an arbitrary unit in a recurrent network, we now define its activation at time t as:

yi(t) = fi(neti(t-1))



At each time step, therefore, activation propagates forward through one layer of connections only. Once some level of activation is present in the network, it will continue to flow around the units, even in the absence of any new input whatsoever. We can now present the network with a time series of inputs, and require that it produce an output based on this series. These networks can be used to model many new kinds of problems, however, these nets also present us with many new difficult issues in training.

Before we address the new issues in training and operation of recurrent neural networks, let us first look at some sample tasks which have been attempted (or solved) by such networks.

The Simple Recurrent Network

One way to meet these requirements is illustrated below in a network known variously as an Elman network (after Jeff Elman, the originator), or as a Simple Recurrent Network. At each time step, a copy of the hidden layer units is made to a copy layer. Processing is done as follows:

  1. Copy inputs for time t to the input units
  2. Compute hidden unit activations using net input from input units and from copy layer
  3. Compute output unit activations as usual
  4. Copy new hidden unit activations to copy layer

In computing the activation, we have eliminated cycles, and so our requirement that the activations of all posterior nodes be known is met. Likewise, in computing errors, all trainable weights are feed forward only, so we can apply the standard backpropagation algorithm as before. The weights from the copy layer to the hidden layer play a special role in error computation. The error signal they receive comes from the hidden units, and so depends on the error at the hidden units at time t. The activations in the hidden units, however, are just the activation of the hidden units at time t-1. Thus, in training, we are considering a gradient of an error function which is determined by the activations at the present and the previous time steps.

A generalization of this approach is to copy the input and hidden unit activations for a number of previous timesteps. The more context (copy layers) we maintain, the more history we are explicitly including in our gradient computation. This approach has become known as Back Propagation Through Time. It can be seen as an approximation to the ideal of computing a gradient which takes into consideration not just the most recent inputs, but all inputs seen so far by the network. The figure below illustrates one version of the process:

The inputs and hidden unit activations at the last three time steps are stored. The solid arrows show how each set of activations is determined from the input and hidden unit activations on the previous time step. A backward pass, illustrated by the dashed arrows, is performed to determine separate values of delta (the error of a unit with respect to its net input) for each unit and each time step separately. Because each earlier layer is a copy of the layer one level up, we introduce the new constraint that the weights at each level be identical. Then the partial derivative of the negative error with respect to wi,j is simply the sum of the partials calculated for the copy of wi,j between each two layers.

Back Prop Through Time
Elman networks and their generalization, Back Propagation Through Time, both seek to approximate the computation of a gradient based on all past inputs, while retaining the standard back prop algorithm. BPTT has been used in a number of applications (e.g. ecg modeling). The main task is to to produce a particular output sequences in response to specific input sequences. The downside of BPTT is that it requires a large amount of storage, computation, and training examples in order to work well. In the next section we will see how we can compute the true temporal gradient using a method known as Real Time Recurrent Learning.


[Top] [Next: Real Time Recurrent Learning] [Back to the first page]