# StemSage

**StemSage: Uncovering Stem-loop Motifs from RBP Binding Regions**

Version: 0.8.6

> "Forever Young. Forever Passionate. Forever Humbled."

## 📖 Introduction

StemSage is an advanced bioinformatics pipeline designed for comprehensive analysis of RNA stem-loop motifs from RBP (RNA-binding protein) binding regions. The tool integrates machine learning approaches with RNA structural analysis to identify significant stem-loop patterns in RNA sequences, providing insights into RNA-protein interactions.

## ✨ Features

- **Multi-input Support**: Accepts BED files or FASTA sequences as input
- **Comprehensive Analysis**: Five-stage pipeline combining RNA structure prediction with machine learning
- **Flexible Workflow**: Supports multiple analysis workflows based on input types
- **Advanced Visualization**: Generates detailed motif visualization reports and interactive plots
- **Parallel Processing**: Supports multi-threading for faster analysis
- **Statistical Validation**: Rigorous statistical testing with multiple correction methods
- **Motif Discovery**: Identifies and maps significant stem-loop motifs to genomic sequences

## 🚀 Installation

### Prerequisites

- Python 3.7+
- Conda environment
- ViennaRNA package for RNA structure prediction

### Dependencies

```bash
conda activate StemSage
pip install numpy pandas scikit-learn xgboost matplotlib seaborn biopython viennarna

### System Dependencies
StemSage requires the following system tools:

# Ubuntu/Debian
sudo apt-get install bedtools vienna-rna

# CentOS/RHEL  
sudo yum install bedtools vienna-rna

# macOS
brew install bedtools vienna-rna

# Check bedtools
bedtools --version

# Check RNAfold
RNAfold --version
```

### Install StemSage

```bash
# Install from conda
conda install stemsage

# Install from pip
pip install stemsage

# Install from source
conda activate stemsage
pip install -e .
```



## 📋 Usage

### Basic Command Structure

```bash
stemsage --positive_bed <positive.bed> --genome_fa <genome.fa> --out <output_dir> [options]
```

### Common Use Cases

#### Case 1: Only Positive BED File

```bash
stemsage --positive_bed peaks.bed --genome_fa hg38.fa --extend 25 --out ./results
```

#### Case 2: Positive and Negative BED Files

```bash
stemsage --positive_bed positive.bed --negative_bed negative.bed --genome_fa hg38.fa --out ./results
```

#### Case 3: Only Positive FASTA File

```bash
stemsage --positive_fasta positive.fasta --genome_fa hg38.fa --out ./results
```

#### Case 4: Positive and Negative FASTA Files

```bash
stemsage --positive_fasta positive.fasta --negative_fasta negative.fasta --out ./results
```

### Advanced Options

#### Model Selection

```bash
# Use Random Forest instead of default XGBoost
stemsage --positive_bed peaks.bed --genome_fa hg38.fa --model random_forest --out ./results
```

#### Stem-loop Detection Parameters

```bash
stemsage --positive_bed peaks.bed --negative_bed control.bed --genome_fa hg38.fa \
    --out ./results --max_motifs 10 --min_stem_length 2 --max_stem_length 5 \
    --threads 6 --cluster_similarity 0.8
```

#### Performance Optimization

```bash
stemsage --positive_bed peaks.bed --genome_fa hg38.fa --out ./results \
    --threads 12 --fast_clustering --similar_matching True \
    --similarity_threshold 0.9 --max_length_diff 1
```



## 🔧 Parameters

### Input Options

- `--positive_bed`: BED file of positive (RBP-binding) regions
- `--negative_bed`: BED file of negative (control) regions
- `--positive_fasta`: FASTA file of positive sequences
- `--negative_fasta`: FASTA file of negative sequences

### Required Arguments

- `--genome_fa`: Genome reference FASTA file (recommend to use gene transcript regions)
- `--out`: Output directory for all results and temporary files

### Analysis Parameters

- `--extend`: Base pairs to extend BED regions on both ends (default: 50)
- `--threads`: Number of threads for parallel processing (default: 1)
- `--model`: Machine learning model: 'xgboost' or 'random_forest' (default: xgboost)
- `--test_size`: Test set size ratio for model evaluation (default: 0.2)
- `--random_state`: Random state for reproducibility (default: 42)
- `--cv_folds`: Number of cross-validation folds (default: 5)

### Stem-loop Detection

- `--max_motifs`: Maximum number of motifs to display in visualization (default: 5)
- `--min_stem_length`: Minimum stem length for stem-loop detection (default: 1)
- `--max_stem_length`: Maximum stem length before splitting (default: 5, 0=disable splitting)
- `--similar_matching`: Enable similar sequence matching for motifs (default: True)
- `--similarity_threshold`: Similarity threshold for motif matching (default: 0.9)
- `--max_length_diff`: Maximum length difference for motif matching (default: 1)

### Output Control

- `--verbose`: Enable verbose logging output
- `--keep_temp`: Keep temporary files after processing
- `--version`, `-v`: Show version information and exit

## 🗂️ Output Structure

```text
output_directory/
├── Stemage_pipeline.log              # Main pipeline log file
├── rna_features_dataset.csv          # Feature matrix (Step1)
├── feature_differences.png           # Feature analysis plots
├── feature_correlation.png
├── feature_distributions.png
├── xgboost_stem_classifier.pkl       # Trained model (Step2)
├── feature_importance_xgboost.csv    # Feature importance
├── model_analysis_xgboost.png        # Model performance plots
├── stem_patterns_analysis.csv        # Pattern analysis (Step3)
├── stem_patterns_characteristics.csv # Stem characteristics
├── stem_patterns_motifs.csv          # Identified motifs
├── stem_patterns_summary.json        # Analysis summary
├── motif_mapping_detailed.csv        # Motif mappings (Step4)
├── motif_mapping_statistics.csv      # Mapping statistics
├── motif_mapping_sequence_summary.csv
├── motif_mapping_analysis.png        # Mapping visualization
├── motif_visualization_report.html   # Interactive report (Step5)
└── Motif/                            # Detailed motif information
    ├── Motif_1.meme
    ├── Motif_1.pdf
    ├── Motif_1.pwm.txt
    ├── Motif_1.svg
    └── ...
```



## 🔄 Pipeline Overview

StemSage implements a comprehensive five-stage analysis pipeline:

### Stage 1: Feature Extraction (`step1.py`)

- **Input**: Positive and negative structure files
- **Processing**:
  - Extracts comprehensive RNA structural features
  - Calculates k-mer frequencies and GC content
  - Identifies stem-loop characteristics
  - Multi-threaded feature extraction
- **Output**: `rna_features_dataset.csv`

### Stage 2: Machine Learning (`step2.py`)

- **Input**: Feature matrix from Stage 1
- **Processing**:
  - Trains XGBoost or Random Forest classifier
  - Performs cross-validation and model evaluation
  - SHAP analysis for feature importance
  - Statistical analysis of feature differences
- **Output**: Trained model, feature importance, performance plots

### Stage 3: Pattern Mining (`step3.py`)

- **Input**: Positive and negative sequence structures
- **Processing**:
  - Extracts detailed stem-loop structures
  - Statistical analysis of pattern enrichment
  - Multiple testing correction (FDR, Bonferroni)
  - Motif clustering and consensus identification
- **Output**: Significant patterns, motif characteristics, statistical summaries

### Stage 4: Motif Mapping (`step4.py`)

- **Input**: Significant motifs from Stage 3
- **Processing**:
  - Maps motifs back to original sequences
  - Flexible matching with similarity thresholds
  - Count consistency validation between stages
  - Detailed sequence-level mapping
- **Output**: Motif-sequence mappings, statistics, visualizations

### Stage 5: Visualization (`step5.py`)

- **Input**: All previous analysis results
- **Processing**:
  - Creates comprehensive HTML reports
  - Interactive motif visualizations
  - Structure and sequence alignments
  - Exportable results for publication
- **Output**: Interactive HTML report, motif visualizations

## 💡 Recommendations

### Genome Reference Preparation

For better results, it's recommended to use transcript region FASTA files:

```bash
# Extract transcript regions from GTF
awk -F'\t' '$3 == "gene" {OFS="\t"; print $1, $4-1, $5, ".", ".", $7}' annotation.gtf > genes.bed

# Extract sequences using bedtools
bedtools getfasta -fi genome.fa -bed genes.bed -fo transcript_regions.fa -s
```

### Parameter Tuning

- **For high-throughput data**: Use `--threads` with higher values
- **For precise motif discovery**: Use stricter similarity thresholds (`--similarity_threshold 0.95`)
- **For novel motif discovery**: Use larger `--max_motifs` and enable `--similar_matching`

### Performance Optimization

- Use SSD storage for large datasets
- Allocate sufficient RAM (≥16GB recommended for large analyses)
- Consider using compute clusters for very large datasets

## 🐛 Troubleshooting

### Common Issues

1. **Memory Errors**:

   ```bash
   # Reduce thread count and enable fast clustering
   stemsage --positive_bed peaks.bed --genome_fa hg38.fa --threads 4 --fast_clustering
   ```

2. **File Not Found Errors**:

   - Ensure all input files are accessible
   - Check file paths and permissions
   - Verify genome FASTA index exists

3. **Dependency Issues**:

   ```bash
   # Reinstall ViennaRNA if RNAfold fails
   conda install -c bioconda viennarna
   ```

4. **Low Mapping Ratios**:

   - Adjust similarity thresholds: `--similarity_threshold 0.8`
   - Increase length difference tolerance: `--max_length_diff 2`

### Log Files

Check `Stemage_pipeline.log` for detailed error information and processing status.

## 📊 Interpretation of Results

### Key Outputs to Examine

1. **Model Performance** (`model_analysis_*.png`):
   - AUC > 0.7 indicates good classification
   - Check feature importance for biological insights
2. **Pattern Significance** (`stem_patterns_analysis.csv`):
   - Focus on patterns with `adjusted_p_value < 0.05`
   - High enrichment values indicate strong positive association
3. **Motif Mappings** (`motif_mapping_*.csv`):
   - Verify count consistency between Step3 and Step4
   - Examine sequence contexts of significant motifs
4. **Visual Reports** (`motif_visualization_report.html`):
   - Interactive exploration of top motifs
   - Structure and sequence conservation analysis

## 📝 Citation

If you use StemSage in your research, please cite:

```bibtex
@software{stemsage2024,
  title = {StemSage: Comprehensive RNA Stem-loop Analysis Pipeline},
  author = {Wang, Zixiang},
  year = {2024},
  url = {https://github.com/PrinceWang2018/stemsage},
  note = {Version 0.8.6}
}
```

## 👥 Authors

- **Zixiang Wang** - *Author* - wangzixiang@sdu.edu.cn
- **Shandong University** - *Organization*

## 📄 License

[License information to be added based on your project license]

## 🤝 Contributing

Contributions are welcome! Please feel free to submit pull requests or open issues for bugs and feature requests.

## 🔗 Links

- **GitHub Repository**: https://github.com/PrinceWang2018/stemsage
- **Documentation**: https://github.com/PrinceWang2018/stemsage/README.md
- **Issue Tracker**: https://github.com/PrinceWang2018/stemsage/issues

------

*StemSage - Uncovering the structural language of RNA-protein interactions*
