Cross-validation and data splitting

class GroupKFoldShuffle(n_splits=5, *, shuffle=False, random_state=None)[source]
split(X, y=None, groups=None)[source]

Generate indices to split data into training and test set.

Parameters

Xarray-like of shape (n_samples, n_features)

Training data, where n_samples is the number of samples and n_features is the number of features.

yarray-like of shape (n_samples,), default=None

The target variable for supervised learning problems.

groupsarray-like of shape (n_samples,), default=None

Group labels for the samples used while splitting the dataset into train/test set.

Yields

trainndarray

The training set indices for that split.

testndarray

The testing set indices for that split.

get_scaffold(smi)[source]

Generate the Bemis-Murcko scaffold for a given molecule.

Parameters:

smi (Union[str, Mol]) – A SMILES string or an RDKit molecule object representing the molecule for which to generate the scaffold.

Return type:

str

Returns:

A SMILES string representing the Bemis-Murcko scaffold of the input molecule. If the scaffold cannot be generated, the input SMILES string is returned.

get_random_clusters(smiles_list)[source]

Generate a list of integers from 0 to the length of the input list.

Parameters:

smiles_list (List[str]) – A list of SMILES strings.

Return type:

List[int]

Returns:

A list of integers from 0 to the length of the input list.

get_butina_clusters(smiles_list, cutoff=0.65)[source]

Cluster a list of SMILES strings using the Butina clustering algorithm.

Parameters:
  • smiles_list (List[str]) – List of SMILES strings

  • cutoff (float) – The cutoff value to use for clustering

Return type:

List[int]

Returns:

List of cluster labels corresponding to each SMILES string in the input list.

get_bemis_murcko_clusters(smiles_list)[source]

Cluster a list of SMILES strings based on their Bemis-Murcko scaffolds.

Parameters:

smiles_list (List[str]) – List of SMILES strings

Return type:

ndarray

Returns:

List of cluster labels corresponding to each SMILES string in the input list.

get_kmeans_clusters(smiles_list, n_clusters=10)[source]

Cluster a list of SMILES strings using the KMeans clustering algorithm.

Parameters:
  • smiles_list (List[str]) – List of SMILES strings

  • n_clusters (int) – The number of clusters to use for clustering

Return type:

ndarray

Returns:

Array of cluster labels corresponding to each SMILES string in the input list.

cross_validate(df, model_list, y_col, group_list, n_outer=5, n_inner=5)[source]

Perform cross-validation on a dataset using multiple models and grouping strategies.

Parameters:
  • df (DataFrame) – The input dataframe containing the data.

  • model_list (List[Tuple[str, Callable[[str], object]]]) – A list of tuples where each tuple contains a model name and a callable that returns a model instance.

  • y_col (str) – The name of the target column.

  • group_list (List[Tuple[str, Callable[[Series], Series]]]) – A list of tuples where each tuple contains a group name and a callable that assigns groups based on the SMILES column.

  • n_outer (int) – The number of outer folds for cross-validation. Default is 5.

  • n_inner (int) – The number of inner folds for cross-validation. Default is 5.

Return type:

List[dict]

Returns:

A dataframe containing the metric values for each fold, model, and group.