torch_geometric.datasets.ProteinMPNNDataset

class ProteinMPNNDataset(root: str, size: str = 'small', split: str = 'train', datacut: str = '2030-01-01', rescut: float = 3.5, homo: float = 0.7, max_length: int = 10000, num_units: int = 150, transform: Optional[Callable] = None, pre_transform: Optional[Callable] = None, pre_filter: Optional[Callable] = None, force_reload: bool = False)[source]

Bases: InMemoryDataset

The ProteinMPNN dataset from the “Robust deep learning based protein sequence design using ProteinMPNN” paper.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • size (str) – Size of the PDB information to train the model. If "small", loads the small dataset (229.4 MB). If "large", loads the large dataset (64.1 GB). (default: "small")

  • split (str, optional) – If "train", loads the training dataset. If "valid", loads the validation dataset. If "test", loads the test dataset. (default: "train")

  • datacut (str, optional) – Date cutoff to filter the dataset. (default: "2030-01-01")

  • rescut (float, optional) – PDB resolution cutoff. (default: 3.5)

  • homo (float, optional) – Homology cutoff. (default: 0.70)

  • max_length (int, optional) – Maximum length of the protein complex. (default: 10000)

  • num_units (int, optional) – Number of units of the protein complex. (default: 150)

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

  • force_reload (bool, optional) – Whether to re-process the dataset. (default: False)