Cross-validation and data splitting

class GroupKFoldShuffle(n_splits=5, *, shuffle=False, random_state=None)[source]

split(X, y=None, groups=None)[source]

Generate indices to split data into training and test set.

Parameters

Xarray-like of shape (n_samples, n_features): Training data, where n_samples is the number of samples and n_features is the number of features.
yarray-like of shape (n_samples,), default=None: The target variable for supervised learning problems.
groupsarray-like of shape (n_samples,), default=None: Group labels for the samples used while splitting the dataset into train/test set.

get_scaffold(smi)[source]

Generate the Bemis-Murcko scaffold for a given molecule.

Parameters:: smi (Union[str, Mol]) – A SMILES string or an RDKit molecule object representing the molecule for which to generate the scaffold.
Return type:: str
Returns:: A SMILES string representing the Bemis-Murcko scaffold of the input molecule. If the scaffold cannot be generated, the input SMILES string is returned.

get_random_clusters(smiles_list)[source]

Generate a list of integers from 0 to the length of the input list.

get_butina_clusters(smiles_list, cutoff=0.65)[source]

Cluster a list of SMILES strings using the Butina clustering algorithm.

Parameters:

Return type:

List[int]

Returns:

List of cluster labels corresponding to each SMILES string in the input list.

get_bemis_murcko_clusters(smiles_list)[source]

Cluster a list of SMILES strings based on their Bemis-Murcko scaffolds.

Parameters:: smiles_list (List[str]) – List of SMILES strings
Return type:: ndarray
Returns:: List of cluster labels corresponding to each SMILES string in the input list.

get_kmeans_clusters(smiles_list, n_clusters=10)[source]

Cluster a list of SMILES strings using the KMeans clustering algorithm.

Parameters:

Return type:

ndarray

Returns:

Array of cluster labels corresponding to each SMILES string in the input list.

cross_validate(df, model_list, y_col, group_list, n_outer=5, n_inner=5)[source]

Perform cross-validation on a dataset using multiple models and grouping strategies.

Parameters:

df (DataFrame) – The input dataframe containing the data.
model_list (List[Tuple[str, Callable[[str], object]]]) – A list of tuples where each tuple contains a model name and a callable that returns a model instance.
y_col (str) – The name of the target column.
group_list (List[Tuple[str, Callable[[Series], Series]]]) – A list of tuples where each tuple contains a group name and a callable that assigns groups based on the SMILES column.
n_outer (int) – The number of outer folds for cross-validation. Default is 5.
n_inner (int) – The number of inner folds for cross-validation. Default is 5.

Return type:

List[dict]

Returns:

A dataframe containing the metric values for each fold, model, and group.