Convolutional LSTMs: Learning Spatiotemporal Dynamics

Back to Posts

Sequential models like LSTMs are designed for temporal dependencies, while convolutional networks capture spatial structures. When data involves both—such as videos, radar maps, or medical scans—a hybrid approach is needed. Convolutional Long Short-Term Memory networks, or ConvLSTMs, are designed precisely for this task.

From LSTMs to Convolutional LSTMs

LSTMs are a class of recurrent neural networks capable of retaining long-term temporal dependencies.

They do this through a system of gates—input, forget, and output—that regulate how information flows through time. In a standard LSTM, these operations are defined over vectors using matrix multiplications. The assumption is that the input is a one-dimensional sequence, such as words in a sentence or readings in a time series.

This assumption breaks when the input has spatial dimensions. Flattening an image or a frame into a vector destroys spatial locality. Each pixel loses its relationship with its neighbors. A network that processes such flattened inputs cannot easily model spatial dependencies, which are essential for understanding motion, texture, and structure in visual or geospatial data.

Convolutional LSTMs solve this by replacing dense operations with convolutions. Instead of treating the input as a vector, they preserve it as a tensor with height, width, and channel dimensions. The hidden and cell states are also tensors, maintaining spatial structure over time. The recurrent connections then operate through convolutional kernels, allowing the model to learn spatial patterns and their evolution over time.

The Mathematical Formulation

The structure of a ConvLSTM cell mirrors that of a standard LSTM, but every matrix multiplication is replaced by a convolution.

\[ i_t = \sigma(W_{xi} * X_t + W_{hi} * H_{t-1} + b_i) \]

\[ f_t = \sigma(W_{xf} * X_t + W_{hf} * H_{t-1} + b_f) \]

\[ C_t = f_t \odot C_{t-1} + i_t \odot \tanh(W_{xc} * X_t + W_{hc} * H_{t-1} + b_c) \]

\[ o_t = \sigma(W_{xo} * X_t + W_{ho} * H_{t-1} + b_o) \]

\[ H_t = o_t \odot \tanh(C_t) \]

Here, "*" denotes convolution instead of matrix multiplication. Each gate's output has the same spatial dimensions as the input, ensuring that both spatial and temporal information propagate through the network. This design allows the model to learn how spatial patterns change over time, rather than simply tracking scalar features.

Implementing a ConvLSTM Cell

Implementing a ConvLSTM in PyTorch requires replacing the linear transformations in a typical LSTM cell with convolutional layers.

import torch
import torch.nn as nn

class ConvLSTMCell(nn.Module):
    def __init__(self, input_channels, hidden_channels, kernel_size):
        super().__init__()
        padding = kernel_size // 2  # Keep spatial dimensions
        self.hidden_channels = hidden_channels
        self.conv = nn.Conv2d(
            in_channels=input_channels + hidden_channels,
            out_channels=4 * hidden_channels,
            kernel_size=kernel_size,
            padding=padding
        )

    def forward(self, x, h_prev, c_prev):
        combined = torch.cat([x, h_prev], dim=1)
        gates = self.conv(combined)
        i, f, o, g = torch.chunk(gates, 4, dim=1)
        i = torch.sigmoid(i)
        f = torch.sigmoid(f)
        o = torch.sigmoid(o)
        g = torch.tanh(g)

        c = f * c_prev + i * g
        h = o * torch.tanh(c)
        return h, c

This cell can be stacked into a multi-layer ConvLSTM network, or used inside a larger model such as a video predictor or a spatiotemporal autoencoder. The main advantage is that convolution operations can exploit local spatial correlations while recurrence handles temporal evolution.

Understanding Spatiotemporal Dynamics

ConvLSTMs do not just track sequences of numbers; they model how patterns evolve in space and time.

Each hidden state encodes the spatial layout of the system at a given time step. Convolutions ensure that information flows between nearby pixels, allowing the network to detect local motion or shape deformations. Over time, the recurrent connections integrate these spatial updates into coherent temporal dynamics.

For instance, in a weather prediction model, each grid cell may represent a geographic region's precipitation level. A ConvLSTM can learn how storms move across space, capturing both their intensity and trajectory. A traditional LSTM would fail to capture such movement because it has no concept of spatial adjacency.

Practical Considerations

ConvLSTMs are computationally expensive.

Convolutions within recurrent loops increase both memory and processing costs. They also require careful initialization and regularization to avoid vanishing or exploding gradients over long sequences. In practice, ConvLSTMs are often combined with downsampling and upsampling operations to control complexity. Encoder–decoder structures with ConvLSTM bottlenecks are common in video prediction and weather forecasting.

Despite their cost, ConvLSTMs are more parameter-efficient than stacking separate CNNs and LSTMs. Because convolutional kernels are shared spatially, they reduce the number of trainable parameters compared to dense layers, making them scalable to large spatial inputs.

Applications and Extensions

ConvLSTMs are used in diverse domains that require modeling spatiotemporal processes.

Video frame prediction is one of the most common applications, where the network learns to predict the next frame given a sequence of past frames. In meteorology, ConvLSTMs have been used to forecast rainfall patterns or wind motion. In medicine, they have been applied to dynamic MRI and ultrasound imaging, where temporal coherence matters.

Extensions of ConvLSTMs include variants with attention mechanisms, 3D convolutions for volumetric data, and bidirectional recurrence for improved temporal context. Research continues to explore hybrid models that combine ConvLSTMs with transformers or graph neural networks to further improve spatiotemporal understanding.

Closing Remarks

ConvLSTMs bridge the gap between convolutional and recurrent architectures.

They bring the spatial inductive bias of CNNs into the temporal domain of RNNs. By doing so, they enable models to learn rich, structured representations of data that change over both space and time. Although computationally demanding, their ability to preserve spatial relationships while capturing temporal evolution makes them indispensable for a wide range of real-world problems.

November 2025

Back to Posts