Metadata-Version: 2.4
Name: sfcv
Version: 0.1.2
Summary: Step Forward Cross Validation for Bioactivity Prediction
Author: Manas Mahale
Author-email: manas.m.mahale@gmail.com
License: Apache 2.0
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: anyio==4.8.0
Requires-Dist: appnope==0.1.4
Requires-Dist: argon2-cffi==23.1.0
Requires-Dist: argon2-cffi-bindings==21.2.0
Requires-Dist: arrow==1.3.0
Requires-Dist: asttokens==3.0.0
Requires-Dist: async-lru==2.0.4
Requires-Dist: attrs==25.1.0
Requires-Dist: babel==2.17.0
Requires-Dist: beautifulsoup4==4.13.3
Requires-Dist: black==25.1.0
Requires-Dist: bleach==6.2.0
Requires-Dist: bokeh==3.6.3
Requires-Dist: certifi==2025.1.31
Requires-Dist: cffi==1.17.1
Requires-Dist: charset-normalizer==3.4.1
Requires-Dist: chemplot==1.3.1
Requires-Dist: click==8.1.8
Requires-Dist: comm==0.2.2
Requires-Dist: contourpy==1.3.1
Requires-Dist: cycler==0.12.1
Requires-Dist: debugpy==1.8.12
Requires-Dist: decorator==5.1.1
Requires-Dist: deepchem==2.8.1.dev20250210221623
Requires-Dist: defusedxml==0.7.1
Requires-Dist: executing==2.2.0
Requires-Dist: fastjsonschema==2.21.1
Requires-Dist: fonttools==4.56.0
Requires-Dist: fqdn==1.5.1
Requires-Dist: h11==0.14.0
Requires-Dist: httpcore==1.0.7
Requires-Dist: httpx==0.28.1
Requires-Dist: idna==3.10
Requires-Dist: iniconfig==2.0.0
Requires-Dist: ipykernel==6.29.5
Requires-Dist: ipython==8.32.0
Requires-Dist: ipython-genutils==0.2.0
Requires-Dist: ipywidgets==8.1.5
Requires-Dist: isoduration==20.11.0
Requires-Dist: jedi==0.19.2
Requires-Dist: Jinja2==3.1.5
Requires-Dist: joblib==1.4.2
Requires-Dist: joypy==0.2.6
Requires-Dist: json5==0.10.0
Requires-Dist: jsonpointer==3.0.0
Requires-Dist: jsonschema==4.23.0
Requires-Dist: jsonschema-specifications==2024.10.1
Requires-Dist: jupyter==1.1.1
Requires-Dist: jupyter-console==6.6.3
Requires-Dist: jupyter-events==0.12.0
Requires-Dist: jupyter-highlight-selected-word==0.2.0
Requires-Dist: jupyter-lsp==2.2.5
Requires-Dist: jupyter_client==8.6.3
Requires-Dist: jupyter_contrib_core==0.4.2
Requires-Dist: jupyter_core==5.7.2
Requires-Dist: jupyter_nbextensions_configurator==0.6.4
Requires-Dist: jupyter_server==2.15.0
Requires-Dist: jupyter_server_terminals==0.5.3
Requires-Dist: jupyterlab==4.3.5
Requires-Dist: jupyterlab_pygments==0.3.0
Requires-Dist: jupyterlab_server==2.27.3
Requires-Dist: jupyterlab_widgets==3.0.13
Requires-Dist: kiwisolver==1.4.8
Requires-Dist: lightgbm==4.6.0
Requires-Dist: llvmlite==0.44.0
Requires-Dist: lxml==5.3.1
Requires-Dist: MarkupSafe==3.0.2
Requires-Dist: matplotlib==3.10.0
Requires-Dist: matplotlib-inline==0.1.7
Requires-Dist: mistune==3.1.1
Requires-Dist: MolVS==0.1.1
Requires-Dist: mordredcommunity==2.0.6
Requires-Dist: mpmath==1.3.0
Requires-Dist: mypy-extensions==1.0.0
Requires-Dist: narwhals==1.26.0
Requires-Dist: nbclient==0.10.2
Requires-Dist: nbconvert==7.16.6
Requires-Dist: nbformat==5.10.4
Requires-Dist: nest-asyncio==1.6.0
Requires-Dist: networkx==3.4.2
Requires-Dist: notebook==7.3.2
Requires-Dist: notebook_shim==0.2.4
Requires-Dist: numba==0.61.0
Requires-Dist: numpy==1.26.4
Requires-Dist: overrides==7.7.0
Requires-Dist: packaging==24.2
Requires-Dist: pandas==2.2.3
Requires-Dist: pandocfilters==1.5.1
Requires-Dist: parso==0.8.4
Requires-Dist: pathspec==0.12.1
Requires-Dist: patsy==1.0.1
Requires-Dist: pexpect==4.9.0
Requires-Dist: pillow==11.1.0
Requires-Dist: platformdirs==4.3.6
Requires-Dist: plotly==6.0.0
Requires-Dist: pluggy==1.5.0
Requires-Dist: prometheus_client==0.21.1
Requires-Dist: prompt_toolkit==3.0.50
Requires-Dist: psutil==7.0.0
Requires-Dist: ptyprocess==0.7.0
Requires-Dist: pure_eval==0.2.3
Requires-Dist: pycparser==2.22
Requires-Dist: Pygments==2.19.1
Requires-Dist: pynndescent==0.5.13
Requires-Dist: pyparsing==3.2.1
Requires-Dist: pytest==8.3.4
Requires-Dist: python-dateutil==2.9.0.post0
Requires-Dist: python-json-logger==3.2.1
Requires-Dist: pytz==2025.1
Requires-Dist: PyYAML==6.0.2
Requires-Dist: pyzmq==26.2.1
Requires-Dist: rdkit==2024.9.5
Requires-Dist: referencing==0.36.2
Requires-Dist: requests==2.32.3
Requires-Dist: rfc3339-validator==0.1.4
Requires-Dist: rfc3986-validator==0.1.1
Requires-Dist: rpds-py==0.22.3
Requires-Dist: scikit-learn==1.6.1
Requires-Dist: scikit-posthocs==0.11.2
Requires-Dist: scipy==1.15.1
Requires-Dist: seaborn==0.13.2
Requires-Dist: Send2Trash==1.8.3
Requires-Dist: six==1.17.0
Requires-Dist: sniffio==1.3.1
Requires-Dist: soupsieve==2.6
Requires-Dist: stack-data==0.6.3
Requires-Dist: statannotations==0.7.1
Requires-Dist: statsmodels==0.14.4
Requires-Dist: sympy==1.13.3
Requires-Dist: terminado==0.18.1
Requires-Dist: threadpoolctl==3.5.0
Requires-Dist: tinycss2==1.4.0
Requires-Dist: tornado==6.4.2
Requires-Dist: tqdm==4.67.1
Requires-Dist: tqdm_joblib==0.0.4
Requires-Dist: traitlets==5.14.3
Requires-Dist: types-python-dateutil==2.9.0.20241206
Requires-Dist: typing_extensions==4.12.2
Requires-Dist: tzdata==2025.1
Requires-Dist: umap==0.1.1
Requires-Dist: umap-learn==0.5.7
Requires-Dist: uri-template==1.3.0
Requires-Dist: urllib3==2.3.0
Requires-Dist: wcwidth==0.2.13
Requires-Dist: webcolors==24.11.1
Requires-Dist: webencodings==0.5.1
Requires-Dist: websocket-client==1.8.0
Requires-Dist: widgetsnbextension==4.0.13
Requires-Dist: xgboost==2.1.4
Requires-Dist: xyzservices==2025.1.0
Dynamic: author
Dynamic: author-email
Dynamic: description
Dynamic: description-content-type
Dynamic: license
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Step Forward Cross Validation for Bioactivity Prediction

This repo contains code to reproduce the results of [SFCV Paper]().
These results include model predictions, tables, and images.
Efforts are made to ensure reproducibility of this project.
In case of undefined behaviour or errors in installing or benchmarking, please open an issue.

![Human Coded](https://img.shields.io/badge/Coded_By-Human-00bfff)

## Install via PyPI

```shell
pip install sfcv
```

or

```shell
pip install git+https://github.com/Manas02/sfcv.git@main
```

## Environment Setup

This project uses [pyvenv](https://docs.python.org/3/library/venv.html) to manage python
environment with `Python 3.11`. The following command will create virtual env in `.venv` directory.

### Create Venv

```shell
python3.11 -m venv .venv
```

### Install Requirements

```shell
pip install -r requirements.txt
```

---
## Dataset

Landrum &
Riniker [[Paper](https://pubs.acs.org/doi/10.1021/acs.jcim.4c00049) | [Data](https://github.com/rinikerlab/overlapping_assays/tree/main/datasets/source_data)]

### 1. Download Datasets and Standardize SMILES

Please open and
  run [00_Data_source_and_standardize.ipynb](https://github.com/Manas02/sfcv/blob/main/notebook/00_Data_source_and_standardize.ipynb)
  to download the
above-mentioned dataset and to standardize the SMILES in those files.

### 2. Predicting LogP, LogD and Computing MCE-18

Follow that by
  running [01_Data_add_LogP_LogD_MCE18.ipynb](https://github.com/Manas02/sfcv/blob/main/notebook/01_Data_add_LogP_LogD_MCE18.ipynb)
to predict and add data for
CrippenLogP ([rdkit](https://www.rdkit.org/docs/GettingStartedInPython.html#descriptor-calculation)),
LogD ([Code](https://gist.github.com/PatWalters/7aebcd5b87ceb466db91b11e07ce3d21)) and
compute [MCE-18](https://pubs.acs.org/doi/abs/10.1021/acs.jmedchem.9b00004).

### 3. Comparing the changes in number of compounds after standardization and deduplication

Follow this with
  running [02_Table_mol_per_target_before_after_standardization.ipynb](https://github.com/Manas02/sfcv/blob/main/notebook/02_Table_mol_per_target_before_after_standardization.ipynb)
  to generate the table and parity plot. The results are saved in `benchmark/results/tables` and
  `benchmark/results/figures` directories. 

### 4. Comparing and Plotting the Distributions of Properties in Dataset

Run [03_Plots_Table_target_properties.ipynb](https://github.com/Manas02/sfcv/blob/main/notebook/03_Plots_Table_target_properties.ipynb)
to get the summary of properties as a table and to plot the distributions.

---
# Method

## 1. Data Splitting

### 1. Implementing `SortedStepForwardCV` and `UnsortedStepForwardCV`

Run [04_Implementation_SFCV.ipynb](https://github.com/Manas02/sfcv/blob/main/notebook/04_Implementation_SFCV.ipynb)
  to visualise how SortedStepForwardCV and UnsortedStepForwardCV work.

### 2. Implementing `ScaffoldSplitCV`

Run [05_Implementation_ScaffoldSplitCV.ipynb](https://github.com/Manas02/sfcv/blob/main/notebook/05_Implementation_ScaffoldSplitCV.ipynb)
to check how ScaffoldSplitCV works. The algorithm groups molecules by their chemical scaffolds, shuffles these groups,
and sequentially assigns entire scaffold groups to the training set until a target fraction is reached, with the
remaining groups forming the test set.

### 3. Implementing `RandomSplitCV`

Run [06_Implementation_RandomSplitCV.ipynb](https://github.com/Manas02/sfcv/blob/main/notebook/06_Implementation_RandomSplitCV.ipynb)
to check how RandomSplitCV works.

### 4. Validating the Splits produce (almost) equal number of test compounds per fold

Run [07_Validate_train_test_split.ipynb](https://github.com/Manas02/sfcv/blob/main/notebook/07_Validate_train_test_split.ipynb)
to visualise number of molecules in test set across folds across targets.

### 5. Plotting Chemical Space wrt Split Type

Run [08_Plots_chemical_space_across_split.ipynb](https://github.com/Manas02/sfcv/blob/main/notebook/08_Plots_chemical_space_across_split.ipynb)
to visualise chemical space wrt Split types.

### 6. Plotting & Comparing Distribution of Sorting properties per Split type per Fold across Targets

Run [09_Plots_Table_split_properties.ipynb](https://github.com/Manas02/sfcv/blob/main/notebook/09_Plots_Table_split_properties.ipynb)
to visualise distributions of sorting properties wrt Split types per fold averaged over all targets.

---
## 2. Metrics

### 1. Implementing Discovery Yield

Run [10_Implimentation_Discovery_Yield.ipynb](https://github.com/Manas02/sfcv/blob/main/notebook/10_Implimentation_Discovery_Yield.ipynb)
to understand and visualise the illustrative example of discovery yield.

### 2. Implementing Novelty Error

Run [11_Implimentation_Novelty_Error.ipynb](https://github.com/Manas02/sfcv/blob/main/notebook/11_Implimentation_Novelty_Error.ipynb)
to understand and visualise the illustrative example of novelty error.

### 3. Implementing Benchmark

Run [12_Implementation_Benchmark.ipynb](https://github.com/Manas02/sfcv/blob/main/notebook/12_Implementation_Benchmark.ipynb)
to see how benchmarking was performed.

---

## 3. Results

### 1. Extract Results

Run [13_Table_extract_results.ipynb](https://github.com/Manas02/sfcv/blob/main/notebook/13_Table_extract_results.ipynb)
to extract results into digestable format.

### 2. Plot Results

Run [14_Plots_results.ipynb](https://github.com/Manas02/sfcv/blob/main/notebook/14_Plots_results.ipynb),
[15_Plots_Result_hERG.ipynb](https://github.com/Manas02/sfcv/blob/main/notebook/15_Plots_Result_hERG.ipynb) and
[16_Plots_Result_MAPK.ipynb](https://github.com/Manas02/sfcv/blob/main/notebook/16_Plots_Result_MAPK.ipynb),
[17_Plots_Result_VEGFR.ipynb](https://github.com/Manas02/sfcv/blob/main/notebook/17_Plots_Result_VEGFR.ipynb) to
visualise the results.
