torch_geometric.nn.attention.QFormer

class QFormer(input_dim: int, hidden_dim: int, output_dim: int, num_heads: int, num_layers: int, dropout: float = 0.0, activation: Callable = ReLU())[source]

Bases: Module

The Querying Transformer (Q-Former) from “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models” paper.

Parameters:
  • input_dim (int) – The number of features in the input.

  • hidden_dim (int) – The dimension of the fnn in the encoder layer.

  • output_dim (int) – The final output dimension.

  • num_heads (int) – The number of multi-attention-heads.

  • num_layers (int) – The number of sub-encoder-layers in the encoder.

  • dropout (int) – The dropout value in each encoder layer.

Note

This is a simplified version of the original Q-Former implementation.

forward(x: Tensor) Tensor[source]

Forward pass.

Parameters:

x (torch.Tensor) – Input sequence to the encoder layer. \(\mathbf{X} \in \mathbb{R}^{B \times N \times F}\), with batch-size \(B\), sequence length \(N\), and feature dimension \(F\).

Return type:

Tensor