# Documentation

## General instructions for running map3C

map3C processes paired-end read alignments generated by certain local aligners. Contacts and cleaned alignments can be generated from a BAM file generated by these aligners. The compatible aligners are:

1. [BWA MEM](https://github.com/lh3/bwa) - for non-bisulfite-converted reads
2. [BWA MEM2](https://github.com/bwa-mem2/bwa-mem2) - for non-bisulfite-converted reads
3. [Biscuit](https://github.com/huishenlab/biscuit) - for bisulfite-converted reads
4. [BSBolt](https://github.com/NuttyLogic/BSBolt) - for bisulfite-converted reads

### Mapping reads

Example commands for paired-end reads for the compatible aligners are shown below. Note that these commands are very generic and specific assays may benefit from different parameters being specified. The critical requirements for the output BAM should always be:

1. The output should be a BAM file
2. The output should include R1 and R2 alignments, distinguished by either:
   * The alignment BAM entry flag
   * The alignment's read name can have an added "_1" or "_2". This is useful for aligning R1 and R2 FASTQ files separately and merging the 2 resultant BAMs into a single BAM.
3. The BAM should be sorted by read name

```{bash}
# BWA MEM
bwa mem -5SPM /path/to/ref.fa sample_R1.fastq.gz sample_R2.fastq.gz \
    | samtools sort -o sample.bam -O BAM

# BWA MEM2
bwa-mem2 mem -5SPM /path/to/ref.fa sample_R1.fastq.gz sample_R2.fastq.gz \
    | samtools sort -o sample.bam -O BAM

# Biscuit
biscuit align -SPM -b 0 /path/to/ref.fa sample_R1.fastq.gz sample_R2.fastq.gz \
                | samtools sort -o sample.bam -O BAM

# BSBolt
bsbolt Align -SPM -DB /path/to/ref_DB -F1 sample_R1.fastq.gz -F2 sample_R2.fastq.gz -O /path/to/sample
```

### Calling contacts with map3C

Install the following Conda environment

* [map3C_tools](envs/map3C_tools.yml)

Run the following command to install map3C

```{bash}
conda activate map3C_tools
# Feel free to specify map3C version
pip install map3C
```

Generate restriction enzyme cut site locations for restriction enzyme of interest (in this case, MboI)

```{bash}
conda activate map3C_tools
map3C restriction-sites --cut-seqs GATC --reference /path/to/ref.fa --output /path/to/map3C_run/txt/MboI.txt
```
Finally, run map3C to generate contacts and a BAM file with removed mapping artifacts

```{bash}
conda activate map3C_tools
map3C call-contacts \
    --bam sample.bam \
    --out-prefix sample \
    --chrom-sizes /path/to/map3C_run/txt/hg38.chrom.sizes \
    --restriction-sites /path/to/map3C_run/txt/MboI.txt \
    --restriction-enzymes MboI \
    --reference-name hg38 \
    --mate-annotation flag \
    --trim-reads \
    --min-mapq 30  \
    --max-molecule-size 750  \
    --max-inter-align-gap 20  \
    --min-inward-dist-enzyme 1000  \
    --min-outward-dist-enzyme 1000 \
    --min-same-strand-dist-enzyme 1000  \
    --min-inward-dist-enzymeless 1000  \
    --min-outward-dist-enzymeless 1000 \
    --min-same-strand-dist-enzymeless 1000  \
    --max-cut-site-split-algn-dist 20 \
    --max-cut-site-whole-algn-dist 500 
```

Note that the command above is very generic and can be modified for different assays. For a detailed description of these and additional parameters, use the following command:

```
conda activate map3C_tools
map3C call-contacts --help
```

### Output of map3C

The output of the above command will produce 3 files:

* `sample_map3C.bam`

  This contains the input alignments that have a) passed the MAPQ filter for `map3C call-contacts` and b) had multimapping artifacts between soft-clipped alignments from the same read trimmed out (if the `--trim-reads` option was used).
  
* `sample_map3C.pairs.gz`

  This contains chromatin contacts. Please refer to the [Pairtools](https://pairtools.readthedocs.io/en/latest/) package for helpful tools for further processing, QC, and usage of this file, as well as understanding its format. This file, at a minimum, contains an added column relative to the default PAIRS format: "contact_class". The possible values of this column are:

  |Value                 | Description |
  | :------------------- | :---------- |
  | {enzyme}             | This is simply the name of a specific restriction enzyme (RE), like MboI or NlaIII. This indicates that the contact's 2 alignments were soft-clipped and were adjacent on the same read. In addition, they were each proximal to a RE cut site in the reference genome for the specific restriction enzyme listed.  | 
  | gap                  | This indicates that the contact's 2 alignments were from different reads. In addition, they were each proximal to a RE cut site in the reference genome for a specific RE. | 
  | enzymeless_chimera   | This indicates that the contact's 2 alignments were soft-clipped and were adjacent on the same read. In addition, they were **not** each proximal to a RE cut site in the reference genome for the specific restriction enzyme listed.  *These are useful for finding SV 1bp-resolution breakpoints.* |
  | enzymeless_gap       | This indicates that the contact's 2 alignments were from different reads. In addition, at least one was **not** proximal to a RE cut site in the reference genome.  |


* `sample_alignment_stats.txt`

  This file contains some in-depth statistics about the number and characteristics of the alignments present in the input BAM file. These statistics are generally not important for typical QC analyses of the alignments and other tools like [samtools](https://www.htslib.org/) are recommended instead. Detailed documentation of these statistics will be forthcoming in a later update.

## Specific instructions for assay-specific Snakemake pipelines

Look in the [`instructions/`](instructions/) directory for assay-specific processing instructions with Snakemake pipelines to improve performance on HPC systems.

