torch_geometric.nn.attention.QFormer

class QFormer(input_dim: int, hidden_dim: int, output_dim: int, num_heads: int, num_layers: int, dropout: float = 0.0, activation: Callable = ReLU())[source]

Bases: Module

The Querying Transformer (Q-Former) from “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models” paper.

Parameters:

input_dim (int) – The number of features in the input.
hidden_dim (int) – The dimension of the fnn in the encoder layer.
output_dim (int) – The final output dimension.
num_heads (int) – The number of multi-attention-heads.
num_layers (int) – The number of sub-encoder-layers in the encoder.
dropout (int) – The dropout value in each encoder layer.

Note

This is a simplified version of the original Q-Former implementation.

forward(x: Tensor) → Tensor[source]

Forward pass.

Parameters:: x (torch.Tensor) – Input sequence to the encoder layer. \(\mathbf{X} \in \mathbb{R}^{B \times N \times F}\), with batch-size \(B\), sequence length \(N\), and feature dimension \(F\).
Return type:: Tensor