Metadata-Version: 2.4
Name: lncrnapi
Version: 0.1.5
Summary: A CLI tool for predicting lncRNA–Protein interactions using transformer embeddings and CatBoost
Home-page: https://github.com/raghavagps/lncrnapi
Author: Gajendra P.S. Raghava
Author-email: raghava@iiitd.ac.in
Keywords: lncrna protein interaction prediction bioinformatics catboost transformers
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.6.0
Requires-Dist: transformers>=4.40.0
Requires-Dist: catboost>=1.2
Requires-Dist: joblib>=1.2
Requires-Dist: tqdm>=4.65
Requires-Dist: numpy>=1.24
Requires-Dist: pandas>=2.0
Requires-Dist: safetensors>=0.6
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# 🧬 lncrna-PI - LncRNA–Protein Interaction Prediction

lncrnaPI is a command-line tool for predicting **lncRNA–Protein interactions** using **pre-trained language models (DNABERT-2 and ESM-2)** for sequence embedding and a **CatBoost classifier** for interaction probability estimation.

---

## 🚀 Overview

This standalone script enables large-scale prediction of interactions between **lncRNA** and **protein sequences**.  
It leverages state-of-the-art transformer models to extract biologically meaningful embeddings and a pre-trained **CatBoost** model to compute interaction probabilities.

---

## 📦 Features

- Supports **FASTA** input for lncRNA and protein sequences.  
- Generates embeddings using:
  - 🧬 **DNABERT-2** (`zhihan1996/DNABERT-2-117M`) for lncRNAs  
  - 🧫 **ESM-2** (`facebook/esm2_t30_150M_UR50D`) for proteins  
- Predicts interaction probabilities using a **CatBoost classifier**.  
- Supports GPU acceleration (**CUDA** / **MPS**) for faster inference.  
- Outputs results in **CSV** format.

---

## 🧰 Installation

Install the package using the following code from the command line:

```bash
pip install lncrnapi
```
---

## ⚙️ Usage

Run the script directly from the command line:

```bash
lncrnapi  --lncrna_fasta /path/to/lncrnas.fasta     --protein_fasta /path/to/proteins.fasta     --model_path /path/to/saved_model.joblib     --output_file /path/to/results.csv
```

### **Arguments**

| Argument | Description | Required |
|-----------|--------------|-----------|
| `--lncrna_fasta` | Path to the FASTA file containing lncRNA sequences. | ✅ |
| `--protein_fasta` | Path to the FASTA file containing protein sequences. | ✅ |
| `--model_path` | Path to the pre-trained CatBoost model file (`.cbm`, `.joblib`, or `.pkl`). | ✅ |
| `--output_file` | Path to save the CSV file with predicted probabilities. | ✅ |

---

## 🧠 How It Works

1. **Model Loading**  
   The tool loads the DNABERT-2 and ESM-2 models from Hugging Face.

2. **FASTA Parsing**  
   Extracts sequence IDs and corresponding sequences from input FASTA files.

3. **Embedding Generation**  
   Computes mean pooled embeddings for each sequence using transformer hidden states.

4. **Prediction**  
   Concatenates embeddings (lncRNA + protein) and predicts the interaction probability using the CatBoost model.

5. **Output**  
   Generates a `.csv` file containing:
   - `LncRNA_ID`
   - `Protein_ID`
   - `Interaction_Probability`

---

## 📊 Example Output

| LncRNA_ID | Protein_ID | Interaction_Probability |
|------------|-------------|--------------------------|
| lnc001 | P12345 | 0.9421 |
| lnc002 | Q8N6T7 | 0.3175 |

---

## ⚡ Hardware Acceleration

The script automatically detects and uses available hardware:

- ✅ **CUDA GPU** (NVIDIA)
- ✅ **MPS** (Apple Silicon)
- ⚠️ **CPU** (fallback)

---

## 🧩 Model Formats Supported

| Format | Description |
|---------|-------------|
| `.cbm` | Native CatBoost model format |
| `.joblib` | Joblib-serialized model |
| `.pkl` | Pickle-based serialized model |

---

## 🛠 Troubleshooting

| Issue | Possible Cause | Solution |
|-------|----------------|-----------|
| `Model file not found` | Wrong `--model_path` | Check the file path |
| `No sequences found in FASTA` | Invalid FASTA format | Ensure `>` headers are present |
| `safetensors` error | Missing library | Install with `pip install safetensors` |
| Slow performance | CPU usage | Use GPU-enabled environment |

---

## 📁 Output Example

```bash
$ head results.csv
LncRNA_ID,Protein_ID,Interaction_Probability
lnc001,P12345,0.9421
lnc002,Q8N6T7,0.3175
lnc003,O76074,0.7814
```

---

## 📜 Citation

If you use this tool in your research, please cite:

> **Your Name et al.**  
> *A Deep Learning Framework for lncRNA–Protein Interaction Prediction Using Transformer-Based Sequence Embeddings* (2025)

---

## 🧩 Repository Structure

```
├── predict_interaction.py       # Main CLI script
├── README.md                    # Documentation
└── example/
    ├── lncrnas.fasta
    ├── proteins.fasta
    └── saved_model.joblib
```

---
