Sequence Modeling: RNNs
Deep-dive into recurrent architectures, unfolding computational graphs, and the BPTT algorithm for sequence learning.
📋 10 · Sequence Modeling: RNNs
1 · Unfolding Graphs
Unfolding is the operation that maps a circuit with recurrent connections to a computational graph with repeated pieces—one per time step. This allows the model to handle variable-length histories using a fixed-size input transition function.
3 · RNN Forward Propagation
Assuming a hyperbolic tangent activation and softmax output, the standard RNN update equations are:
- \( \mathbf{a}^{(t)} = \mathbf{b} + \mathbf{W}\mathbf{h}^{(t-1)} + \mathbf{U}\mathbf{x}^{(t)} \)
- \( \mathbf{h}^{(t)} = anh(\mathbf{a}^{(t)}) \)
- \( \mathbf{o}^{(t)} = \mathbf{c} + \mathbf{V}\mathbf{h}^{(t)} \)
- \( \hat{\mathbf{y}}^{(t)} = ext{softmax}(\mathbf{o}^{(t)}) \)
4 · BPTT Algorithm
Computing the gradient involves a forward pass (left to right) followed by a backward pass (right to left). The runtime is \( O( au) \) where \( au \) is the sequence length. This cannot be parallelized because the graph is inherently sequential.
5 · Teacher Forcing
Teacher forcing is a training technique where the ground truth output \( \mathbf{y}^{(t)} \) is fed as input at time \( t+1 \), rather than the model's own (potentially noisy) output. This decouples time steps and allows for parallelization during training.