Cross-validation and data splitting
- class GroupKFoldShuffle(n_splits=5, *, shuffle=False, random_state=None)[source]
- split(X, y=None, groups=None)[source]
Generate indices to split data into training and test set.
Parameters
- Xarray-like of shape (n_samples, n_features)
Training data, where n_samples is the number of samples and n_features is the number of features.
- yarray-like of shape (n_samples,), default=None
The target variable for supervised learning problems.
- groupsarray-like of shape (n_samples,), default=None
Group labels for the samples used while splitting the dataset into train/test set.
Yields
- trainndarray
The training set indices for that split.
- testndarray
The testing set indices for that split.
- get_scaffold(smi)[source]
Generate the Bemis-Murcko scaffold for a given molecule.
- Parameters:
smi (
Union[str,Mol]) – A SMILES string or an RDKit molecule object representing the molecule for which to generate the scaffold.- Return type:
- Returns:
A SMILES string representing the Bemis-Murcko scaffold of the input molecule. If the scaffold cannot be generated, the input SMILES string is returned.
- get_random_clusters(smiles_list)[source]
Generate a list of integers from 0 to the length of the input list.
- get_butina_clusters(smiles_list, cutoff=0.65)[source]
Cluster a list of SMILES strings using the Butina clustering algorithm.
- get_bemis_murcko_clusters(smiles_list)[source]
Cluster a list of SMILES strings based on their Bemis-Murcko scaffolds.
- get_kmeans_clusters(smiles_list, n_clusters=10)[source]
Cluster a list of SMILES strings using the KMeans clustering algorithm.
- cross_validate(df, model_list, y_col, group_list, n_outer=5, n_inner=5)[source]
Perform cross-validation on a dataset using multiple models and grouping strategies.
- Parameters:
df (
DataFrame) – The input dataframe containing the data.model_list (
List[Tuple[str,Callable[[str],object]]]) – A list of tuples where each tuple contains a model name and a callable that returns a model instance.y_col (
str) – The name of the target column.group_list (
List[Tuple[str,Callable[[Series],Series]]]) – A list of tuples where each tuple contains a group name and a callable that assigns groups based on the SMILES column.n_outer (
int) – The number of outer folds for cross-validation. Default is 5.n_inner (
int) – The number of inner folds for cross-validation. Default is 5.
- Return type:
- Returns:
A dataframe containing the metric values for each fold, model, and group.