Delta-Bar-Delta (Jacobs)

Since the cost surface for multi-layer networks can be complex, choosing a learning rate can be difficult. What works in one location of the cost surface may not work well in another location. Delta-Bar-Delta is a heuristic algorithm for modifying the learning rate as training progresses:

Each weight has its own learning rate.
For each weight: the gradient at the current timestep is compared with the gradient at the previous step (actually, previous gradients are averaged)
If the gradient is in the same direction the learning rate is increased
If the gradient is in the opposite direction the learning rate is decreased
Should be used with batch only.

Let

gij(t) = gradient of E wrt wij at time t

then define

Then the learning rate μij for weight wij at time t+1 is given by

where β, κ , and γ are chosen by the hand.

Downsides:

Knowing how to choose the parameters β, κ , and γ is not easy.
Doesn't work for online.