# -*- coding: utf-8 -*-
from setuptools import setup

packages = \
['vembrane', 'vembrane.modules']

package_data = \
{'': ['*']}

install_requires = \
['asttokens>=2.0,<3.0',
 'intervaltree>=3.1,<4.0',
 'pysam>=0.19,<0.20',
 'pyyaml>=6.0,<7.0']

extras_require = \
{':python_version >= "3.8"': ['numpy>=1.23,<2.0']}

entry_points = \
{'console_scripts': ['vembrane = vembrane.cli:main']}

setup_kwargs = {
    'name': 'vembrane',
    'version': '0.12.1',
    'description': 'Filter VCF/BCF files with Python expressions.',
    'long_description': '[![CI](https://github.com/vembrane/vembrane/actions/workflows/main.yml/badge.svg)](https://github.com/vembrane/vembrane/actions/workflows/main.yml)\n[![DOI](https://zenodo.org/badge/276383670.svg)](https://zenodo.org/badge/latestdoi/276383670)\n[![Bioconda](https://anaconda.org/bioconda/misopy/badges/installer/conda.svg)](https://bioconda.github.io/recipes/vembrane/README.html)\n\n# vembrane: variant filtering using python expressions\n\nvembrane allows to simultaneously filter variants based on any `INFO` or `FORMAT` field, `CHROM`, `POS`, `ID`, `REF`, `ALT`, `QUAL`, `FILTER`, and the annotation field `ANN`. When filtering based on `ANN`, annotation entries are filtered first. If no annotation entry remains, the entire variant is deleted.\n\nvembrane relies on [pysam](https://pysam.readthedocs.io/en/latest/) for reading/writing VCF/BCF files.\n\nFor a comparison with similar tools have a look at the [vembrane benchmarks](https://github.com/vembrane/vembrane-benchmark).\n\n## Installation\nvembrane is available in [bioconda](https://bioconda.github.io/) and can either be installed into an existing conda environment with `mamba install -c bioconda vembrane` or into a new named environment `mamba create -n environment_name -c bioconda vembrane`.\nAlternatively, if you are familiar with git and [poetry](https://python-poetry.org/), clone this repository and run `poetry install`.\nSee [docs/develop.md](docs/develop.md) for further details.\n\n## `vembrane filter`\n\n### Usage\nvembrane takes two positional arguments: The filter expression and the input file; the latter may be omitted to read from `stdin` instead, making it easy to use vembrane in pipe chains.\n```\nusage: vembrane filter [options] expression [input vcf/bcf]\n\noptions:\n  -h, --help            show this help message and exit\n  --output OUTPUT, -o OUTPUT\n                        Output file. If not specified, output is written to STDOUT.\n  --output-fmt {vcf,bcf,uncompressed-bcf}, -O {vcf,bcf,uncompressed-bcf}\n                        Output format.\n  --annotation-key FIELDNAME, -k FIELDNAME\n                        The INFO key for the annotation field. Defaults to "ANN".\n  --aux NAME PATH, -a NAME PATH\n                        Path to an auxiliary file containing a set of symbols.\n  --keep-unmatched      Keep all annotations of a variant if at least one of them\n                        passes the expression (mimics SnpSift behaviour).\n  --preserve-order      Ensures that the order of the output matches that of the input.\n                        This is only useful if the input contains breakends (BNDs)\n                        since the order of all other variants is preserved anyway.\n```\n\n\n### Filter expression\nThe filter expression can be any valid python expression that evaluates to `bool`. However, functions and symbols available have been restricted to the following:\n\n * `all`, `any`\n * `abs`, `len`, `max`, `min`, `round`, `sum`\n * `enumerate`, `filter`, `iter`, `map`, `next`, `range`, `reversed`, `sorted`, `zip`\n * `dict`, `list`, `set`, `tuple`\n * `bool`, `chr`, `float`, `int`, `ord`, `str`\n * Any function or symbol from [`math`](https://docs.python.org/3/library/math.html)\n * Any function from [`statistics`](https://docs.python.org/3/library/statistics.html)\n * Regular expressions via [`re`](https://docs.python.org/3/library/re.html)\n * custom functions:\n   * `without_na(values: Iterable[T]) -> Iterable[T]` (keep only values that are not `NA`)\n   * `replace_na(values: Iterable[T], replacement: T) -> Iterable[T]` (replace values that are `NA` with some other fixed value)\n   * genotype related:\n     * `count_hom`, `count_het` , `count_any_ref`, `count_any_var`, `count_hom_ref`, `count_hom_var`\n     * `is_hom`, `is_het`, `is_hom_ref` , `is_hom_var`\n     * `has_ref`, `has_var`\n\n### Available fields\nThe following VCF fields can be accessed in the filter expression:\n\n|Name|Type|Interpretation|Example expression|\n|---|---|---|---|\n|`INFO`|`Dict[str, Any¹]`| `INFO field -> Value`  | `INFO["DP"] > 0`|\n|`ANN`| `Dict[str, Any²]`| `ANN field -> Value` | `ANN["Gene_Name"] == "CDH2"`|\n|`CHROM`| `str` | Chromosome Name  |  `CHROM == "chr2"` |\n|`POS`| `int` | Chromosomal position  | `24 < POS < 42`|\n|`ID`| `str`  | Variant ID |  `ID == "rs11725853"` |\n|`REF`| `str` |  Reference allele  | `REF == "A"` |\n|`ALT`| `str` |  Alternative allele³  | `ALT == "C"`|\n|`QUAL`| `float`  | Quality |  `QUAL >= 60` |\n|`FILTER`| `List[str]` | Filter tags | `"PASS" in FILTER` |\n|`FORMAT`|`Dict[str, Dict[str, Any¹]]`| `Format -> (Sample -> Value)` | `FORMAT["DP"][SAMPLES[0]] > 0` |\n|`SAMPLES`|`List[str]`| `[Sample]`  |  `"Tumor" in SAMPLES` |\n|`INDEX`|`int`| Index of variant in the file  |  `INDEX < 10` |\n\n ¹ depends on type specified in VCF header\n\n ² for the usual snpeff and vep annotations, custom types have been specified; any unknown ANN field will simply be of type `str`. If something lacks a custom parser/type, please consider filing an issue in the [issue tracker](https://github.com/vembrane/vembrane/issues).\n\n ³ vembrane does not handle multi-allelic records itself. Instead, such files should be\n preprocessed by either of the following tools (preferably even before annotation):\n - [`bcftools norm -m-any […]`](http://samtools.github.io/bcftools/bcftools.html#norm)\n - [`gatk LeftAlignAndTrimVariants […] --split-multi-allelics`](https://gatk.broadinstitute.org/hc/en-us/articles/360037225872-LeftAlignAndTrimVariants)\n - [`vcfmulti2oneallele […]`](http://lindenb.github.io/jvarkit/VcfMultiToOneAllele.html)\n\n\n### Examples\n\n* Only keep annotations and variants where gene equals "CDH2" and its impact is "HIGH":\n  ```sh\n  vembrane filter \'ANN["Gene_Name"] == "CDH2" and ANN["Annotation_Impact"] == "HIGH"\' variants.bcf\n  ```\n* Only keep variants with quality at least 30:\n  ```sh\n  vembrane filter \'QUAL >= 30\' variants.vcf\n  ```\n* Only keep annotations and variants where feature (transcript) is ENST00000307301:\n  ```sh\n  vembrane filter \'ANN["Feature"] == "ENST00000307301"\' variants.bcf\n  ```\n* Only keep annotations and variants where protein position is less than 10:\n  ```sh\n  vembrane filter \'ANN["Protein"].start < 10\' variants.bcf\n  ```\n* Only keep variants where mapping quality is exactly 60:\n  ```sh\n  vembrane filter \'INFO["MQ"] == 60\' variants.bcf\n  ```\n* Only keep annotations and variants where consequence contains the word "stream" (matching "upstream" and "downstream"):\n  ```sh\n  vembrane filter \'re.search("(up|down)stream", ANN["Consequence"])\' variants.vcf\n  ```\n* Only keep annotations and variants where CLIN_SIG contains "pathogenic", "likely_pathogenic" or "drug_response":\n  ```sh\n  vembrane filter \\\n    \'any(entry in ANN["CLIN_SIG"]\n         for entry in ("pathogenic", "likely_pathogenic", "drug_response"))\' \\\n    variants.vcf\n  ```\n  Using set operations, the same may also be expressed as:\n  ```sh\n  vembrane filter \\\n    \'not {"pathogenic", "likely_pathogenic", "drug_response"}.isdisjoint(ANN["CLIN_SIG"])\' \\\n    variants.vcf\n  ```\n\n### Custom `ANN` types\n`vembrane` parses entries in the annotation field as outlined in [docs/ann_types.md](docs/ann_types.md).\n\n### Missing values in annotations\n\nIf a certain annotation field lacks a value, it will be replaced with the special value of `NA`. Comparing with this value will always result in `False`, e.g.\n`ANN["MOTIF_POS"] > 0` will always evaluate to `False` *if* there was no value in the "MOTIF_POS" field of ANN (otherwise the comparison will be carried out with the usual semantics).\n\nSince you may want to use the regex module to search for matches, `NA` also acts as an empty `str`, such that `re.search("nothing", NA)` returns nothing instead of raising an exception.\n\n*Explicitly* handling missing/optional values in INFO or FORMAT fields can be done by checking for NA, e.g.: `INFO["DP"] is NA`.\n\nHandling missing/optional values in fields other than INFO or FORMAT can be done by checking for None, e.g `ID is not None`.\n\nSometimes, multi-valued fields may contain missing values; in this case, the `without_na` function can be convenient, for example: `mean(without_na(FORMAT[\'DP\'][s] for s in SAMPLES)) > 2.3`. It is also possible to replace `NA` with some constant value with the `replace_na` function: `mean(replace_na((FORMAT[\'DP\'][s] for s in SAMPLES), 0.0)) > 2.3`\n\n### Auxiliary files\n`vembrane` supports additional files, such as lists of genes or ids with the `--aux NAME path/to/file` option. The file should contain one item per line and is parsed as a set. For example `vembrane filter --aux genes genes.txt "ANN[\'SYMBOL\'] in AUX[\'genes\']" variants.vcf` will keep only records where the annotated symbol is in the set specified in `genes.txt`.\n\n## `vembrane table`\n\nIn addition to the `filter` subcommand, vembrane (`≥ 0.5`) also supports writing tabular data with the `table` subcommand.\nIn this case, an expression which evaluates to `tuple` is expected, for example:\n```sh\nvembrane table \'CHROM, POS, 10**(-QUAL/10), ANN["CLIN_SIG"]\' input.vcf > table.tsv\n```\n\nWhen handling **multi-sample VCFs**, you often want to iterate over all samples in a record by looking at a `FORMAT` field for all of them.\nHowever, if you use a standard Python list comprehension (something like `[FORMAT[\'DP\'][sample] for sample in SAMPLES]`), this would yield a single column with a list containing one entry per sample (something like `[25, 32, 22]` for three samples with the respective depths).\n\nIn order to have a separate column for each sample, you can use the **`for_each_sample()`** function in both the main `vembrane table` expression and the `--header` expression.\nIt should contain one [lambda expression](https://docs.python.org/3/reference/expressions.html#lambda) with exactly one argument, which will be substituted by the sample names in the lambda expression.\n\nFor example, you could specifiy expressions for the `--header` and the main VCF record evaluation like this:\n```sh\nvembrane table --header \'CHROM, POS, for_each_sample(lambda sample: f"{sample}_depth")\' \'CHROM, POS, for_each_sample(lambda s: FORMAT["DP"][s])\' input.vcf > table.tsv\n```\nGiven a VCF file with samples `Sample_1`, `Sample_2` and `Sample_3`, the header would expand to be printed as:\n```\nCHROM  POS   Sample_1_depth   Sample_2_depth   Sample_3_depth\n```\nand the expression to evaluate on each VCF record would become:\n```python\n(CHROM, POS, FORMAT[\'DP\'][\'Sample_1\'], FORMAT[\'DP\'][\'Sample_2\'], FORMAT[\'DP\'][\'Sample_3\'])\n```\n\nWhen not supplying a `--header` expression, the entries of the expanded main expression become the column names in the header.\nWhen supplying a header via `--header`,  its `for_each_sample()` expects an expression which can be evaluated to `str` and must have the same number of fields as the main expression.\n\nPlease note that, as anywhere in vembrane, you can use arbitrary Python expressions in `for_each_sample()` lambda expressions.\nSo you can for example perform computations on fields or combine multiple fields into one value:\n```sh\nvembrane table \'CHROM, POS, for_each_sample(lambda sample: FORMAT["AD"][sample] / FORMAT["DP"][sample] * QUAL)\' input.vcf > table.tsv\n```\n\nInstead of using the `for_each_sample` (wide format) machinery, it is also possible to generate the data in long format by specifying the `--long` flag.\nIn this case, the first column will always be called `SAMPLE` and there\'s an additional variable of the same name available for the expressions.\nFor example:\n```sh\nvembrane table --long \'CHROM, POS, FORMAT["AD"][SAMPLE] / FORMAT["DP"][SAMPLE] * QUAL\' input.vcf > long_table.tsv\n```\n\n## `vembrane annotate`\n\nvembrane is able to annotate vcf files with a given table-like file. In addition to the vcf and annotation file, the user has to provide a configuration file.\n\nConfiguration (Example):\n\n```yaml\n## example.yaml\nannotation:\n    file: "example.tsv" # the table-like annotation file column with header\n    columns:\n      chrom: "chrom" # column name of the annotation file refering to the chromosome\n      start: "chromStart" # column name of the annotation file refering to the chromosome start\n      stop: "chromEnd" # column name of the annotation file refering to the chromosome end\n    delimiter: "\\t" # delimiter of the columns\n    values:\n    - value: # a new annotation entry in the info field of the vcf\n        vcf_name: "genehancer_score" # the name of annotation entry\n        number: "1" # number of values for each entry\n        description: "Score from genehancer." # description of this entry in the header\n        type: "Float" # type of the values\n        expression: "DATA[\'score\'][0]" # any python expression to calculate the value(s)\n                                       # DATA[\'score\'] refers to the \'score\' column of the annotation field\n    - value: # a second annotation entry to annotate\n        vcf_name: "genehancer_score2"\n        number: "1"\n        description: "Score from genehancer."\n        type: "Float"\n        expression: "log(max(DATA[\'score\']) * 2)"\n```\n\nexample.tsv (Example):\n```\nchrom\tchromStart\tchromEnd\tname\tscore\nchr10\t76001\t77000\tHJSDHKD\t463\nchr10\t120054\t130024\tHJSJHKD\t463\nchr10\t432627\t492679\tIDASJLD\t327\nchr10\t540227\t872071\tSZAGHSD\t435\nchr10\t654480\t1000200\tHSJKJSD\t12\n```\n\nExemplary invocation: `vembrane annotate example.yaml example.bcf > annotated.vcf`.\n\nInternally for each vcf record the overlapping regions of the annotation file are determined and stored in `DATA`. The expression may then access the `DATA` object and its columns by the columns names to generate a single or multiple values of cardinality `number` of type `type`. These values are stored in the new annotation entry under the name `vcf_name` and with header description `description`.\n\n## Authors\n\n* Marcel Bargull (@mbargull)\n* Jan Forster (@jafors)\n* Till Hartmann (@tedil)\n* Johannes Köster (@johanneskoester)\n* Elias Kuthe (@eqt)\n* David Lähnemann (@dlaehnemann)\n* Felix Mölder (@felixmoelder)\n* Christopher Schröder (@christopher-schroeder)\n',
    'author': 'Till Hartmann',
    'author_email': None,
    'maintainer': None,
    'maintainer_email': None,
    'url': 'https://github.com/vembrane/vembrane',
    'packages': packages,
    'package_data': package_data,
    'install_requires': install_requires,
    'extras_require': extras_require,
    'entry_points': entry_points,
    'python_requires': '>=3.8',
}


setup(**setup_kwargs)
