Metadata-Version: 2.4
Name: samsift
Version: 0.3.1
Summary: Advanced filtering and tagging of SAM/BAM alignments using Python expressions
Home-page: https://github.com/karel-brinda/samsift
Author: Karel Brinda
Author-email: karel.brinda@inria.fr
License: MIT
Keywords: SAM,BAM,sequencing,alignment,filtering,tagging,genomics
Classifier: Development Status :: 5 - Production/Stable
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Operating System :: Unix
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.8
Description-Content-Type: text/x-rst
License-File: LICENSE
Requires-Dist: pysam
Dynamic: license-file

SAMsift
=======

.. image:: https://github.com/karel-brinda/samsift/actions/workflows/ci.yml/badge.svg
        :target: https://github.com/karel-brinda/samsift/actions/workflows/ci.yml

.. image:: https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square
        :target: https://anaconda.org/bioconda/samsift

.. image:: https://badge.fury.io/py/samsift.svg
        :target: https://badge.fury.io/py/samsift

.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.1048211.svg
        :target: https://doi.org/10.5281/zenodo.1048211

SAMsift is a program for advanced filtering and tagging of SAM/BAM alignments
using Python expressions.


Getting started
---------------

.. code-block:: bash

       # clone this repo and add it to PATH
       git clone http://github.com/karel-brinda/samsift
       cd samsift
       export PATH=$(pwd)/samsift:$PATH

       # filtering: keep only alignments with score >94, save them as filtered.bam
       samsift -i tests/test.bam -o filtered.bam -f 'AS>94'
       # filtering: keep only unaligned reads
       samsift -i tests/test.bam -f 'FLAG & 0x04'
       # filtering: keep only aligned reads
       samsift -i tests/test.bam -f 'not(FLAG & 0x04)'
       # filtering: keep only sequences containing ACCAGAGGAT
       samsift -i tests/test.bam -f 'SEQ.find("ACCAGAGGAT")!=-1'
       # filtering: keep only sequences containing A and T only (defined using regular expressions)
       samsift -i tests/test.bam -f 're.match(r"^[AT]*$", SEQ)'
       # filtering: sample alignments with 25% rate
       samsift -i tests/test.bam -f 'random.random()<0.25'
       # filtering: sample alignments with 25% rate with a fixed RNG seed
       samsift -i tests/test.bam -f 'random.random()<0.25' -0 'random.seed(42)'
       # filtering: keep only alignments of reads specified in tests/qnames.txt
       samsift -i tests/test.bam -0 'q=open("tests/qnames.txt").read().splitlines()' -f 'QNAME in q'
       # filtering: keep only first 5000 reads from chr1 and 5000 reads from chr2
       samsift -i tests/test.bam -0 'c={"chr1":5000,"chr2":5000}' -f 'c[RNAME]>0' -c 'c[RNAME]-=1' -m nonstop-remove
       # tagging: add tags 'ln' with sequence length and 'ab' with average base quality
       samsift -i tests/test.bam -c 'ln=len(SEQ);ab=1.0*sum(QUALa)/ln'
       # tagging: add a tag 'ii' with the number of the current alignment
       samsift -i tests/test.bam -0 'i=0' -c 'i+=1;ii=i'
       # updating: removing sequences and base qualities
       samsift -i tests/test.bam -c 'a.query_sequence=""'
       # updating: switching all reads to unaligned
       samsift -i tests/test.bam -c 'a.flag|=0x4;a.reference_start=-1;a.cigarstring="";a.reference_id=-1;a.mapping_quality=0'


Installation
------------

**Using Bioconda:**

.. code-block:: bash

        # add all necessary Bioconda channels
        conda config --add channels defaults
        conda config --add channels conda-forge
        conda config --add channels bioconda

        # install samsift
        conda install samsift


**Using PIP from PyPI:**

.. code-block:: bash

   pip install --upgrade samsift


**Using PIP from Github:**

.. code-block:: bash

   pip install --upgrade git+https://github.com/karel-brinda/samsift


Command-line parameters
-----------------------

.. USAGE-BEGIN

.. code-block::

	Program: samsift (advanced filtering and tagging of SAM/BAM alignments using Python expressions)
	Version: 0.3.1
	Author:  Karel Brinda <karel.brinda@inria.fr>

	Usage:   samsift.py [-i FILE] [-o FILE] [-f [PY_EXPR ...]] [-c [PY_CODE ...]] [-m STR]
	                    [-0 [PY_CODE ...]] [-d [PY_EXPR ...]] [-t [PY_EXPR ...]]

	Basic options:
	  -h, --help        show this help message and exit
	  -v, --version     show program's version number and exit
	  -i FILE           input SAM/BAM file [-]
	  -o FILE           output SAM/BAM file [-]
	  -f [PY_EXPR ...]  filtering expression [True]
	  -c [PY_CODE ...]  code to be executed (e.g., assigning new tags) [None]
	  -m STR            mode: strict (stop on first error)
	                          nonstop-keep (keep alignments causing errors)
	                          nonstop-remove (remove alignments causing errors) [strict]

	Advanced options:
	  -0 [PY_CODE ...]  initialization [None]
	  -d [PY_EXPR ...]  debugging expression to print [None]
	  -t [PY_EXPR ...]  debugging trigger [True]


.. USAGE-END

Algorithm
---------

.. code-block:: python

        exec(INITIALIZATION)
        for ALIGNMENT in ALIGNMENTS:
                if eval(DEBUG_TRIGER):
                        print(eval(DEBUG_EXPR))
                if eval(FILTER):
                        exec(CODE)
                        print(ALIGNMENT)


**Python expressions and code.** All expressions and code should be valid with
respect to `Python 3 <https://docs.python.org/3/>`_. Expressions are evaluated
using the `eval <https://docs.python.org/3/library/functions.html#eval>`_
function and code is executed using the `exec
<https://docs.python.org/3/library/functions.html#exec>`_ function.
Initialization can be used for importing Python modules, setting global
variables (e.g., counters) or loading data from disk. Some modules (namely
``datetime``, ``math``, ``random``, and ``re``) are loaded without an explicit request,
and the internal RNG seed is set to 42.

*Example* (printing all alignments):

.. code-block:: bash

        samsift -i tests/test.bam -f 'True'

**SAM fields.** Expressions and code can access variables mirroring the fields
from the alignment section of the `SAM specification
<https://samtools.github.io/hts-specs/SAMv1.pdf>`_, i.e., ``QNAME``, ``FLAG``,
``RNAME``, ``POS`` (1-based), ``MAPQ``, ``CIGAR``, ``RNEXT``, ``PNEXT``,
``TLEN``, ``SEQ``, and ``QUAL``. Several additional variables are defined to
simply accessing some useful information: ``QUALa`` stores the base qualities
as an integer array;  ``SEQs``, ``QUALs``, ``QUALsa`` skip soft-clipped bases;
and ``RNAMEi`` and ``RNEXTi`` store the reference ids as integers.

*Example* (keeping only the alignments with leftmost position <= 10000):

.. code-block:: bash

        samsift -i tests/test.bam -f 'POS<=10000'


SAMsift internally uses the `PySam <http://pysam.readthedocs.io/>`_ library and
the representation of the current alignment (an instance of the class
`pysam.AlignedSegment
<http://pysam.readthedocs.io/en/latest/api.html#pysam.AlignedSegment>`_) is
available as a variable ``a``. Therefore, the previous example is equivalent to

.. code-block:: bash

        samsift -i tests/test.bam -f 'a.reference_start+1<=10000'


The ``a`` variable can also be used for modifying the current alignment record.

*Example* (removing the sequence and the bases from every record):

.. code-block:: bash

        samsift -i tests/test.bam -c 'a.query_sequence=""'


**SAM tags.** Every SAM tag is translated to a variable with the same name.

*Example* (removing alignments with a score smaller or equal to the sequence length):

.. code-block:: bash

        samsift -i tests/test.bam -f 'AS>len(SEQ)'

If ``CODE`` is provided, all two-letter variables except ``re`` (the Python regex
module) are back-translated to tags after the code execution.

*Example* (adding a tag ``ab`` carrying the average base quality):

.. code-block:: bash

        samsift -i tests/test.bam -c 'ab=1.0*sum(QUALa)/len(QUALa)'

**Errors.** If an error occurs during an evalution of an expression or an
execution of a code (e.g., due to accessing an undefined tag), then SAMsift
behavior depends on the specified mode (``-m``).  With the strict mode (``-m
strict``, default), SAMsift will immediately interrupt the computation and
report an error.  With the ``-m nonstop-keep`` option, SAMsift will continue
processing the alignments while keeping the error-causing alignments in the
output.  With the ``-m nonstop-remove`` option, all error-causing alignments
are skipped and ommited from the output.


Similar programs
----------------

* `samtools view <http://www.htslib.org/doc/samtools.html>`_ can filter alignments based on FLAGS, read group tags, and CIGAR strings.
* `sambamba view <http://lomereiter.github.io/sambamba/docs/sambamba-view.html>`_ supports, in addition to SAMtools, a filtration using `simple Perl-like expressions <https://github.com/lomereiter/sambamba/wiki/%5Bsambamba-view%5D-Filter-expression-syntax>`_. However, it is not possible to use floats or compare different tags.
* `BamQL <https://github.com/BoutrosLaboratory/bamql>`_ provides a simple query language for filtering SAM/BAM files.
* `bamPals <https://github.com/zeeev/bamPals>`_ adds tags XB, XE, XP and XL.
* `SamJavascript <http://lindenb.github.io/jvarkit/SamJavascript.html>`_ can filter alignments using JavaScript expressions.
* `Picard FilterSamReads <https://broadinstitute.github.io/picard/command-line-overview.html#FilterSamReads>`_ can also filter alignments using JavaScript expressions.


Issues
------

Please use `Github issues <https://github.com/karel-brinda/samsift/issues>`_.


Changelog
---------

See `Releases <https://github.com/karel-brinda/samsift/releases>`_.


Licence
-------

`MIT <https://github.com/karel-brinda/samsift/blob/master/LICENSE>`_


Author
------

`Karel Brinda <http://brinda.eu>`_ <karel.brinda@inria.fr>
