Metadata-Version: 2.4
Name: esloco
Version: 1.0.0
Summary: Estimating Long-Read Coverage for Whole-Genome Sequencing
Author: aweich
Project-URL: Repository, https://github.com/aweich/esloco.git
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.12.2
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: biopython
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: joblib
Requires-Dist: tqdm_joblib
Requires-Dist: psutil
Requires-Dist: matplotlib
Requires-Dist: seaborn
Requires-Dist: scipy
Requires-Dist: plotly
Requires-Dist: kaleido
Dynamic: license-file

<br>
<p align="center">
    <img src="docs/esloco_logo.png" alt="Illustration" width="200">
</p>

<h3 align="center">
<br>

Simulation-guided <ins>es</ins>timation of <ins>lo</ins>cal <ins>co</ins>verage for long-read DNA sequencing (<em>esloco</em>)

</h3>
<p align="center">
<code>esloco</code> is a flexible tool for estimating local coverage in long-read DNA sequencing experiments, designed to support both <em>a priori</em> experimental planning and <em>post hoc</em> interpretation of sequencing results.
 <br>
    <a href="">Publication (not available yet)</a>
</p>

### Summary

<p align="center">
    <img src="docs/VIS_Simulation_Illustration.png" alt="Illustration" width="50%">
</p>

The illustration above demonstrates the workflow of `esloco`, which iteratively uses repeated simulations to estimate DNA sequencing coverage over target regions. The process involves two main steps:

1. **Generating target coordinates**: Barcoded genomes are generated by converting the reference genome into a single coordinate space. Randomized insertions (`[I]` mode) or user-defined regions of interest (`[ROI]` mode) are assigned as target regions.

2. **Simulating reads**: For each barcode, reads are drawn from a user-defined or empirical read length distribution and randomly placed until the desired whole-genome coverage is reached. Reads falling into blocked regions are excluded. 

3. **Calculating target coverage**: Local coverage for each target is calculated based on on-target bases of overlapping reads. 

> [!NOTE]
> The simulation assumes a uniform placement of long-reads during DNA sequencing. Referring to the Monte-Carlo method, `esloco` is based on the idea that random sampling can approximate deterministic outcomes, with target coverage primarily influenced by read length distribution and whole-genome coverage.

---

### Installation

The simulation and all its dependencies can be installed from `PyPI` using `pip` via

```bash
pip install esloco
```
---

or by cloning this repository and manually installing the dependencies mentioned in the `.toml` file. 

### Usage

Run the simulation on the command line using

```bash
esloco --config <configfile>
```

After the simulation, some basic plots and an interactive overview can be created automatically from the simulation's output tables using

```bash
plot_esloco --config <configfile>
```

---

### Configuration File

Below are the configuration options for the simulation, divided into three sections: `[COMMON]`, `[ROI]`, and `[I]`.  Mandatory fields marked with **(*)**. 

#### `[COMMON]`

| **Option**                     | **Description**                                                                                     | **Default**                                                                 |
|---------------------------------|-----------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|
| `mode`                         | Running mode; can be `I` or `ROI`. Corresponding section must be provided in the same configuration file.                                                      |          **(*)**                                                                     |
| `reference_genome_path`        | Path to the reference genome.                                                           |         **(*)**                                                                  |
| `sequenced_data_path`          | Custom FASTA/FASTQ file with reads.                                                                      | `None`                                                                          |
| `output_path`                  | Output path for simulation results.                                                                | `./output/`                                                                          |
| `experiment_name`              | Name of the experiment.                                                                             | `default_experiment`                                                                      |
| `output_path_plots`            | Output path for plots.                                                                              | output_path                                                                    |
| `min_overlap_for_detection`    | Minimum overlap required between target region and read to be counted.                             | `1`                                                                                |
| `min_read_lengths`    | Minimum length of reads drawn from the generated or provided read length dsitribution.                             | `1`     |
| `chr_restriction`              | Chromosome restriction; `unrestricted` uses all chromosomes, otherwise excludes "M" or "_" chromosomes. | `None`                                                                             |
| `barcode_weights`              | Changes barcode ratios.                                                                             | `None`                                                                |
| `n_barcodes`                   | Number of barcodes (i.e., genomes).                                                                | `1`                                                                                |
| `iterations`                   | Number of iterations.                                                                               | `1`                                                                                |
| `scaling`                      | Scaling factor; set to `0.5` if insertion/ROI is expected on one allele in a diploid genome.        | `1.0`                                                                                |
| `parallel_jobs`                | Number of cores assigned for parallelization; each job performs one iteration individually.         | `1`                                                                               |
| `coverages`                    | List of coverages used for the simulation.                                                         | `[1]`                                                                      |
| `mean_read_lengths`            | List of mean read lengths used for the simulation; each combination of coverages and lengths is performed per iteration. | `[1000]`                                                              |
| `blocked_regions_bedpath`      | BED file with regions that will be blocked from read generation.                                    | `None`                                                               |
 `no_cov_plots`      | Prevents the drawing of coverage plots during the first iteration. `True` significantly speeds up the simulation for small iteration numbers or high (`>25`) coverages.                                    | `False`                                                               |
 `seed`      | If defined, previous runs can be exactly reproduced.                                    | `random int`    
#### `[ROI]`

| **Option**      | **Description**                     | **Default** |
|------------------|-------------------------------------|---------------------|
| `roi_bedpath`    | BED file with regions of interest. Gene IDs containing underscores (`_`) are not supported. | **(*)**        |

#### `[I]`

| **Option**                     | **Description**                                                                                     | **Default**                                                                 |
|---------------------------------|-----------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|
| `insertion_number_distribution`| Distribution of insertion numbers; if not `poisson`, numbers are fixed.                            | `poisson`                                                                          |
| `insertion_length`             | Length of the inserted sequence.                                                                   | `1000`                                                                            |
| `bedpath`                      | BED file that limits the insertion placement.                                                      | `None`                                               |
| `insertion_numbers`            | Number of insertions; if `insertion_number_distribution=poisson`, this is the mean of the distribution, and the actual number is drawn randomly. | `5`                                                                                |

---


### Simulation output

#### Data

The simulation has four main output files:

1. `{experiment_name}_log.log`: Log files with key processes, resources, and debugging information.

2. `{experiment_name}_barcode_distribution_table.csv`: Barcode distribution per condition and iteration.

3. `{experiment_name}_matches_table.csv`: Detections per ROI/insertion, including overlap details.

    - `[I]`: `Barcode_{Barcode number}_insertion_{insertion number}_{iteration number}` (e.g., Barcode_0_insertion_1_0).

    - `[ROI]`: `{roi}_{Barcode number}_{iteration number}` (e.g., TRGVA_0_0).
4. `{experiment_name}_summary.csv`: Brief summary statistics of detections per ROI/insertion.

5. `{experiment_name}_insertion_locations.bed`: Insertion coordinates (Only in `[I]` mode).

#### Figures

During the simulation, basic coverage plots are generated for the first iteration of each parameter set to monitor expected coverage and target region locations. Also, if no read length distribution is provided, the generated distribution is also saved in `.npy` format. 

For a more comprehensive analysis, `plot_esloco --config {configfile}` can generate additional overview plots. These are summarized in an interactive `{experiment_name}_report.html`. 

<details>
<summary>Click to show exemplary overview of report figures</summary>

<p align="center">
    <div>
        <code>{experiment_name}_Barplot_Barcode_absolute_numbers</code>:<br>
        <a href="/docs/MeanCountBarplot.png" target="_blank">
            <img src="/docs/MeanCountBarplot.png" alt="Barplot Absolute Numbers" width="90%">
        </a>
    </div>
    <br>
    <div>
        <code>{experiment_name}_lineplot_otb_matches</code>:<br>
        <a href="/docs/lineplot.png" target="_blank">
            <img src="/docs/lineplot.png" alt="Lineplot" width="90%">
        </a>
    </div>
    <br>
    <div>
        <code>{mean_read_length}_{coverage}_coverage</code>:<br>
        <a href="/docs/coverage.png" target="_blank">
            <img src="/docs/coverage.png" alt="Coverage Plot" width="90%">
        </a>
    </div>
</p>

</details>

---

## Advanced Usage

##### Custom read length distributions

If there is `FASTA`/`FASTQ` file provided in the `sequenced_data_path` option of the `configurationfile.ini`, the simulation will use it as its read length distribution, thus dicarding any options provided as `mean_read_lengths`. 

> [!TIP] 
> If your read length input files are `.fasta.gz` instead of `.fasta`, you can decompress
them beforehand to improve simulation speed.

##### Blocked regions `[COMMON]`

If there is a `blocked_regions_bedpath` provided, the simulation will use the locations, as defined by the first three columns (`chr` - `start` - `end`), as well as optional `id` and `weight` columns. The weight column ultimately determines how _strict_ the blocking of the defined region will be performed. If no `weight` is assigned, the default of 1, i.e. 100% will be applied. If only 50% of the reads that fall into this region are supposed to be blocked, `weight` should be set to 0.5. This allows for more complex non-uniformly distributed setups or partial monosomies.   

##### Barcode Weighting `[COMMON]`

###### Cellular proportions `[COMMON]`

One key feature of the simulation is the option to manually assign weights to individual barcodes, allowing to simulate different cellular compositions and over- or under-representations of specific subsets. These weights can be directly assigned in the `configfile`. Here, `barcode_weights` can be set to any `directory` structure with the `barcode_numbers` as keys and their assigned proportions as values (e.g. `barcode_weights={"0": 10}`). In this example, a read drawn at random is 10 times more likely to be originating from/assigned to barcode 0. 

>[!Note] 
> Weights of individual barcodes are calculated as 
>
>$$S_{k} = \frac{W_{k} \cdot N}{D}$$
>
>where S<sub>k</sub> is the weighted share of barcode k, and D is the common denominator, given by:
>
>$$D = \left( \sum_{i} W_{i} + N - \lvert W \rvert \right) \cdot N$$
>
>W<sub>i</sub>​ refers to the individual barcode weights specified in the user-defined weights dictionary, N is the total number of barcodes, and ∣W∣ represents the number of barcodes with assigned weights. 
>
>In summary, barcodes with assigned weights contribute proportionally, while unweighted barcodes are evenly distributed. 

###### Selective blocking/masking `[COMMON]`

It is also possible to select barcodes that will be affected by the pre-defined [Blocked regions](#blocked-regions-common). For this, an optional `barcode` column can be added to the bed file, containing a list of barcodes (e.g. ["0", "1"]). 

>[!TIP] 
> If the goal is to assign different blocking values to different subsets of barcodes, this can be achieved by adding an `id` column and multiple rows for the same blocked region, which forces the simulation run each set of blocked coordinates individually. 

##### Full chromosome mode `[COMMON]` `[I]` `[ROI]`

Generally, if any of the `BED` files provided contain entries with only a chromosome defined and start and stop set to 0, the simulation will use the full chromosome. 

##### Fixed insertion numbers and/or locations `[I]`

It is also possible to fix the number and/or location of insertions. For this, the `insertion_number_distribution` in the `configfile` needs to be set to anything else than `poisson`, which is the default. By doing this, the `insertion_numbers` option will use the defined value as a fixed value, and not as the mean value of a poisson distribution. For a fixed number of insertions at specific locations, this can be achieved by providing a `bedpath` in the `[I]` section of the `configfile`, which has the exact same number of entries as `insertion_numbers`.

---
## Scaling up

Sometimes, multiple back-to-back simulations are required. To handle this, `esloco` can be easily wrapped inside a `bash` loop that processes a directory containing multiple configuration files:

```bash
#!/bin/bash

for config in path/to/configs/*; do
    echo "Processing: $config"
    esloco --config "$config"
    plot_esloco --config "$config"
done
```

## Limitations

- Uniform read placement
    > `esloco` currently places reads uniformly at random along the genome, implying that all regions are equally likely to be sequenced. In practice, this assumption is violated due to  sequencing biases, such as GC content effects, which still exist in long-read technologies. While `esloco` does not explicitly account for such biases, its support for [weighted blocking of regions](#blocked-regions-common) allows users to incorporate empirical correction factors derived from prior sequencing data, offering a route for customization in bias-aware applications.

- Short worker timeout / memory leak warning
    > In some instances, the simulation will return the following warning: "A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak." to the CLI. Usually, this has happened before when using thousands of barcodes during `Insertion [I]` mode and only a limited amount of CPUs. Usually, neither the simulation nor its results should be affected by the warning.  

## Citation & Contribution

If you like this tool, please consider giving it a star. If you are using this tool for your research, please cite our upcoming publication. 
