Metadata-Version: 2.4
Name: spaed
Version: 1.0.5
Summary: A module for the segmentation of phage endolysin domains based on the PAE matrix from AlphaFold.
Project-URL: Homepage, https://github.com/Rousseau-Team/spaed.git
Author-email: Alexandre Boulay <alexandre.boulay.6@ulaval.ca>
License-File: LICENSE
Keywords: bacteriophage,bioinformatics,delineation,domain,phage,protein,segmentation
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Programming Language :: Python :: 3
Requires-Python: >3.10
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: scipy
Description-Content-Type: text/markdown

<p align="center">
  <img src="img/title.png" border="0"/>
  <h2 align="center">Segmentation of PhAge Endolysin Domains</h2>
</p>

SPAED is a tool to identify domains in phage endolysins. It takes as input the PAE file(s) obtained from AlphaFold and outputs a csv file with delineations.

Additional scripts are provided to visualize predicted domains with PyMOL and to obtain their amino acid sequences. 

## Installation & usage

Check out [www.spaed.ca](https://spaed.ca) to launch SPAED quickly!

First create a virtual environment, then: 

**From pypi**:
```
pip install spaed  ### note the spelling of spaed
```

ex. `spaed pae_path --output_file spaed_predictions.csv`


**From source**:
```
git clone https://github.com/Rousseau-Team/spaed.git

pip install numpy pandas scipy
```

ex. `python spaed/src/spaed/spaed.py pae_path`


## Advanced usage
Optional dependency for structure visualisation: pymol (`conda install -c conda-forge -c schrodinger pymol-bundle`). Python>3.10 is required, 3.12.9 worked for me.\
ex. (install from pip). `pymol_vis pred_path pdb_path --output_folder pymol_output --output_type {pse|png|both}`\
ex. (install from source). `python spaed/src/spaed/pymol_vis.py pred_path pdb_path --output_folder pymol_output --output_type {pse|png|both}`

**Positional arguments**:
- **pae_path** - Folder of or singular PAE file in json format as outputted by Alphafold2/3 or Colabfold.


**Optional arguments**:
- **output_file** - File to save table of segmented domains in csv format. (default spaed_predictions.csv)
- **fasta_path** - Path to fasta file or folder containing fasta files. If specified, spaed will save the sequences corresponding to predicted domains,linkers and disordered regions into new fasta files named  "spaed_predicted_{seq_type}.faa" in the same output folder as output_file. Ensure fasta names or headers correspond to entries in pae files.
- **RATIO_NUM_CLUSTERS** - Maximum number of clusters initially generated by hierarchical clustering corresponds to len(protein) // RATIO_NUM_CLUSTERS. (Default 10). For a protein 400 residues long, 40 clusters will be generated.
- **MIN_DOMAIN_SIZE** - Minimum size a domain can have. (default 30).
- **PAE_SCORE_CUTOFF** - Cutoff on the PAE score used to make adjustments to predicted domains/linkers/disordered regions. Residues with PAE score < PAE_SCORE_CUTOFF are considered close together. (default = 4).
- **MIN_DISORDERED_SIZE** - Minimum size a terminal disordered region can be to be considered a separate entity from the domain it is next to (default 20).
- **FREQ_DISORDERED** - For a given residue in the PAE matrix, frequency of residues that can align to it with a low PAE score and still be considered "not part of a domain". Values <MIN_DOMAIN_SIZE are logical, but as it increases, the more leniant the algorithm becomes to non-domain regions (more will be predicted). (default 6).
- **PROP_DISORDERED** - Proportion of residues in a given region that must meet FREQ_DISORDERED criteria to be considered a terminal disordered region. The greater the value, the stricter the criteria to predict the region as disordered. (default 80%).
- **FREQ_LINKER** - For a given residue in the PAE matrix, frequency of residues that can align to it with a low PAE score and still be considered as part of the linker. Values < MIN_DOMAIN_SIZE are logical as they are less than the expected size of the nearest domain. Increasing leads to a more leniant assignment of residues as part of the linker. (default 20).
- **version** - Display installed SPAED version number.

If you are interested in looking at the disordered regions in N- or C-terminal, consider increasing FREQ_DISORDERED ([4-30]), decreasing MIN_DISORDERED_SIZE ([10-30]) or decreasing PROP_DISORDERED ([50-95]). This will result in more (and longer) terminal disordered regions being detected, but also many false positives. I would not change them all at the same time as this will probably increase the sensitivity too much.

If you are interested in linkers or have a protein that is less well folded, consider modifying the FREQ_LINKER parameter ([4-30]). This value is used to adjust the boundaries of the linkers and as such, a higher value will result in longer linkers. However, linkers that were missed will still not be detected.


## Outputs
A csv file containing the proteinID, protein length, number of predicted domains, domain delineations, linker delineations, terminal disordered region delineations. Delineations for each domain are separated by a ";".\
Ex.

|        | length | # domains |    domains     | linkers | disordered |
| ------ | ------ | --------- | -------------- | ------- | ---------- |
| prot 1 | 251    | 2         | 1-120;130-251  | 121-129 |            |
| prot 2 | 386    | 2         | 86-203;217-386 | 204-216 | 1-85       |

## Citation

Boulay, A. et al. SPAED: Harnessing AlphaFold Output for Accurate Segmentation of Phage Endolysin Domains. 2025.04.25.650745 Preprint at https://doi.org/10.1101/2025.04.25.650745 (2025).

