Metadata-Version: 2.1
Name: plinkformatter
Version: 0.1.46
Summary: 
Author: nick-sebasco
Author-email: nicksebasco.jax@gmail.com
Requires-Python: >=3.9,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: joblib (>=1.4.2,<2.0.0)
Requires-Dist: numpy (>=1.26.4,<2.0.0)
Requires-Dist: pandas (>=2.2.2,<3.0.0)
Requires-Dist: pyarrow (>=21.0.0,<22.0.0)
Requires-Dist: pytest (>=8.2.2,<9.0.0)
Requires-Dist: scipy (>=1.13.1,<2.0.0)
Description-Content-Type: text/markdown

# PLINKFORMMATER

This repository is designed to transform genotype data from the Muster SNPs/download endpoint into PLINK-compatible data, which can then be processed by PyLMM. The primary aim is to facilitate genomic data transformation for linear mixed models. This tool should, in theory, also work with GEMMA and other software that consumes standard PLINK file formats, though it has been primarily tested with PyLMM.

## Getting Started

### Prerequisites
To use this repository, you must have the following dependencies installed:

+ Python 3.8 or above
+ PLINK 2.0
+ Poetry

**Install dependencies**:
```
poetry install
```

**Activate virtual environment**:
```
poetry shell
```

**Install PLINK**: 

You can download PLINK from the [official website](https://www.cog-genomics.org/plink/2.0/). Ensure the PLINK executable is in your system's PATH, or you can specify the path to the PLINK binary in your environment settings.

To verify PLINK installation, run:
```
plink2 --version
```

## Tests

To run tests run the following command:
```
pytest -s tests
```

## Publishing to Pypi

0. Update version

```
poetry version patch
```

1. Build any changes

```
poetry build
```

2. Set the correct PyPI repository URL

```
poetry config repositories.pypi https://upload.pypi.org/legacy/
```

3. Set API token

```
poetry config pypi-token.pypi pypi-YourActualTokenHere
```

4. Publish

```
poetry publish
```

## TODO

Software decisions
+ [] operating on measure directory is inferior pattern than operating on a list of MeasureInput
    dataclass objects.  MeasureInputs have a localfile attribute thus they could exist in any folder it wouldn't matter.  this also prevents the need for creating an unecessary measure_id folder.

Differences from Hao's R code:
+ [x] Pheno ordering matters too
+ [x] Massive pheno file difference.  Need to de-deuplicate
+ [x] stop upper casing strains, Hao mentioned this.
+ [x] Hao creates kinship matrix from pedmap not bfile
+ [x] Hao uses --pedmap PLINK flag not --ped --map, switch in an effort to keep everything the same.
+ [x] confirm that the first column of the .pheno file is zscore: 
    expected strain | strain | zscore | value -> matches actual
+ [x] `generate_pheno_plink_fast` still uses `create_sorted_pgen`, Hao's code does not leverage the pfiles.
-- [x] sanity check, ensure no p-files created
    45911_plink_new_v1 has only bed bim bam like Hao.
+ [x] we have an autosome only filter on --chr 1-19
