========
cutadapt
========

cutadapt removes adapter sequences from DNA high-throughput
sequencing data. This is usually necessary when the read length of the
machine is longer than the molecule that is sequenced, such as in
microRNA data.

cutadapt is implemented in Python, with an extension module,
written in C, that implements the alignment algorithm.


Project homepage
================

See http://code.google.com/p/cutadapt/


License
=======

(This is the MIT license.)

Copyright (c) 2010 Marcel Martin <marcel.martin@tu-dortmund.de>

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.


Dependencies
============

cutadapt needs Python 2.6 or later. It will most likely not work with Python 3.
For installation from sources, a C compiler needs to be installed.
The program has been developed and tested on Ubuntu and OpenSuSE.


Installation
============

$ python setup.py build
$ python setup.py install


Use without installation
========================

Build the C extension module:
$ python setup.py build_ext -i

Then simply run the script from where it is, similar to this:
$ /home/username/downloads/cutadapt-0.x/cutadapt --help


How to use, examples
====================

Please also see the command-line help:
$ cutadapt --help

The basic command-line for cutadapt looks like this:
$ cutadapt -a AACCGGTT input.fastq > output.fastq
The adapter sequence is given with -a option. Replace
AACCGGTT with your actual adapter sequence.

Multiple adapters can be provided using -a more than once.
The program will then search for all given adapters and remove
the one that matches best.

input.fastq is a file with reads. The result will be written
to standard output. Use redirection with '>' (or cutadapt's -o
option) to write the output to a file.

By default, the output file contains all reads, even those
that did not contain an adapter. (See also the --discard option.)


Illumina data
-------------

Assuming your sequencing data is available as a FASTQ file, use this
command line:
$ cutadapt -a ADAPTER-SEQUENCE input.fastq > output.fastq

gz-compressed input is supported:
$ cutadapt -a ADAPTER-SEQUENCE input.fastq.gz > output.fastq

gz-compressed output is also supported, but the -o parameter (output file) needs
to be used (gzip compression is auto-detected by looking at the file name):
$ cutadapt -a ADAPTER-SEQUENCE -o output.fastq.gz input.fastq.gz


SOLiD data
----------

With color space data, the option -c must be used and also the adapter must
be given in color space.

Cut an adapter from SOLiD data given in solid.csfasta and solid.qual.
Produce MAQ- and BWA-compatible output, allow 12% errors, write the
resulting FASTQ file to output.fastq. Add the prefix "abc:" to the read
names:

$ cutadapt -c -e 0.12 -a 330201030313112312 -x abc: --maq solid.csfasta solid.qual > output.fastq

Instead of redirecting standard output with ">", the "-o" option can be used:

$ cutadapt -c -e 0.12 -a 330201030313112312 -x abc: --maq -o output.fastq solid.csfasta solid.qual

Do the same, but produce BFAST-compatible output and strip the _F3 suffix from read names:

$ cutadapt -c -e 0.12 -a 330201030313112312 -x abc: --strip-f3 solid.csfasta solid.qual > output.fastq


FASTA file
----------

Cut an adapter from reads given in a FASTA file. Try to remove an adapter three times
(this is usually not needed), use the default error rate of 10%, write result to output.fa:

$ cutadapt -n 3 -a TGAGACACGCAACAGGGGAAAGGCAAGGCACACAGGGGATAGG input.fa > output.fa


Quality Trimming
----------------

The '-q' (or --trim-qualities) parameter can be used to trim low-quality ends
from reads before adapter removal. For this to work correctly, the quality
values must be encoded as ascii(phred quality + 33). If they are encoded as
ascii(phred quality + 64), you currently have to add 31 to the cutoff. For
example, if you actually mean "-q 10", you have to write "-q 41".

The trimming algorithm is the same as the one used by BWA. That is: Subtract
the given cutoff from all qualities; compute partial sums from all indices to
the end of the sequence; cut sequence at the index at which the sum is minimal.


Adapters
========

These are some 454 adapters:
A1:   5'- TCCATCTCATCCCTGCGTGTCCCATCTGTTCCCTCCCTGTCTCA
A2:   5'- TGAGACAGGGAGGGAACAGATGGGACACGCAGGGATGAGATGGA
B1:   5'- CCTATCCCCTGTGTGCCTTGCCTATCCCCTGTTGCGTGTCTCA
B2:   5'- TGAGACACGCAACAGGGGAAAGGCAAGGCACACAGGGGATAGG

This is an AB SOLiD adapter (in color space) used in the SREK protocol:
330201030313112312


Algorithm
=========

cutadapt uses a simple semi-global alignment algorithm, without any special optimizations.
For speed, the algorithm is implemented as a Python extension module in calignmodule.c.

The program is sufficiently fast for my purposes, but speedups should be simple to achieve.


Partial adapter matches
-----------------------

Cutadapt correctly deals with partial adapter matches. As an example, suppose
your adapter sequence is "ADAPTER" (specified via the -a or --adapter
command-line parameter). If you have these input sequences:

MYSEQUENCEADAPTER
MYSEQUENCEADAP
MYSEQUENCEADAPTERSOMETHINGELSE

All of them will be trimmed to "MYSEQUENCE". If the sequence starts with an
adapter, like this:

ADAPTERSOMETHING

It will be empty after trimming.

When the allowed error rate is sufficiently high (set with parameter -e), errors in
the adapter sequence are allowed. For example, ADABTER (1 mismatch), ADAPTR (1 deletion),
and ADAPPTER (1 insertion) will all be recognized if the error rate is set to 0.15.


Allowing adapters anywhere
--------------------------

Cutadapt assumes that any adapter specified via the -a (or --adapter) parameter
was ligated to the 3' end of the sequence. This is the correct assumption for
at least the SOLiD and Illumina small RNA protocols and probably others.

If, on the other hand, your adapter can also be ligated to the 5' end (on
purpose or by accident), you should tell cutadapt so by using the -b (or
--anywhere) parameter. It will then use a different alignment algorithm and
correctly trim adapters that appear in the beginning of a read. An adapter
specified via -b will also be found if it appears only partially in the
beginning of a read. For example, these sequences

ADAPTERMYSEQUENCE
PTERMYSEQUENCE

will be trimmed to "MYSEQUENCE". Note that the regular algorithm would trim
the first read to an empty sequence.

The -b parameter currently does not work with color space data.


Changes
=======

v0.9.2
------
* Install a single 'cutadapt' Python package instead of multiple Python
  modules. This avoids cluttering the global namespace and should lead to less
  problems with other Python modules. Thanks to Steve Lianoglou for
  pointing this out to me!
* ignore case (ACGT vs acgt) when comparing the adapter with the read sequence
* .FASTA/.QUAL files (not necessarily color space) can now be read (some
  454 software uses this format)
* Move some functions into their own modules
* lots of refactoring: replace the fasta module with a much nicer seqio module.
* allow to input FASTA/FASTQ on standard input (also FASTA/FASTQ is
  autodetected)

v0.9
----
* add --too-short-output and --untrimmed-output, based on patch by Paul Ryvkin (thanks!)
* add --maximum-length parameter: discard reads longer than a specified length
* group options by category in --help output
* add --length-tag option. allows to fix read length in FASTA/Q comment lines
  (e.g., 'length=123' becomes 'length=58' after trimming) (requested by Paul Ryvkin)
* add -q/--quality-cutoff option for trimming low-quality ends (uses the same algorithm
  as BWA)
* some refactoring
* the filename '-' is now interpreted as standard in or standard output

v0.8
----
* Change default behavior of searching for an adapter: The adapter is now assumed to
  be an adapter that has been ligated to the 3' end. This should be the correct behavior
  for at least the SOLiD small RNA protocol (SREK) and also for the Illumina protocol.
  To get the old behavior, which uses a heuristic to determine whether the adapter was
  ligated to the 5' or 3' end and then trimmed the read accordingly, use the new
  -b (--anywhere) option.
* Clear up how the statistics after processing all reads are printed.
* Fix incorrect statistics. Adapters starting at pos. 0 were correctly trimmed,
  but not counted.
* Modify scoring scheme: Improves trimming (some reads that should have been
  trimmed were not). Increases no. of trimmed reads in one of our SOLiD data sets
  from 36.5 to 37.6%.
* Speed improvements (20% less runtime on my test data set).

v0.7
----
* Useful exit codes
* Better error reporting when malformed files are encountered
* Add --minimum-length parameter for discarding reads that are shorter than
  a specified length after trimming.
* Generalize the alignment function a bit. This is preparation for
  supporting adapters that are specific to either the 5' or 3' end.
* pure Python fallback for alignment function for when the C module cannot
  be used.

v0.6
----
* Support gzipped input and output.
* Print timing information in statistics.

v0.5
----
* add --discard option which makes cutadapt discard reads in which an adapter occurs

v0.4
----
 * (more) correctly deal with multiple adapters: If a long adapter matches with lots of
   errors, then this could lead to a a shorter adapter matching with few errors getting ignored.

v0.3
----
 * fix huge memory usage (entire input file was unintentionally read into memory)

v0.2
----
* allow FASTQ input

v0.1
----
* initial release


To Do / Ideas
=============

  * add test cases to version control
  * allow to give adapter in nucleotide space, but find it in color space
  * show average error rate
  * show table of length vs. max errors
  * In color space and probably also for Illumina data, gapped alignment
    is not necessary
  * bzip2 support
  * use str.format instead of %
  * allow to change scores at runtime (perhaps using command-line parameters)
  * multi-threading
  * --progress
  * refactor
  * run pylint, pychecker
  * print adapter fragments in statistics
  * --discard-uncut (discard sequences in which no adapter was found)
  * giving a qual file should enable color space
  * no. of trimmed nucleotides
  * length histogram
  * refactor read_sequences (use classes)
  * put write_read into a Fast(a|q)Writer class?
  * allow .txt input/output
  * test on Windows
  * SRA color space FASTQ files include a quality for the initial nucleotide, this is not handled correctly
