Metadata-Version: 2.1
Name: poola
Version: 0.0.17
Summary: Python package to analyze the results of pooled CRISPR screens
Home-page: https://github.com/broadinstitute/poola/tree/master/
Author: Peter DeWeirdt, Mudra Hegde
Author-email: petedeweirdt@gmail.com, mudrahegde@gmail.com
License: Apache Software License 2.0
Keywords: CRISPR,python
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.19.5
Requires-Dist: pandas>=1.0.0
Requires-Dist: scikit-learn>=0.24.1
Requires-Dist: scipy>=1.5.2
Requires-Dist: statsmodels>=0.12.2

# poola
> Python package for pooled screen analysis


## Install

Install from github for the latest development release:

`pip install git+git://github.com/gpp-rnd/poola.git#egg=poola`

Or install the most recent distribution from PyPi:

[![PyPI version](https://badge.fury.io/py/poola.svg)](https://badge.fury.io/py/poola)

`pip install poola`

## How to use

Additional packages required for this tutorial can be install using `pip install -r requirements.txt`

```
from poola import core as pool
import pandas as pd
import seaborn as sns
import gpplot
import matplotlib.pyplot as plt
```

To demonstrate the functionality of this module we'll use read counts from [Sanson et al. 2018](https://doi.org/10.1038/s41467-018-07901-8).

```
supp_reads = 'https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-018-07901-8/MediaObjects/41467_2018_7901_MOESM4_ESM.xlsx'
read_counts = pd.read_excel(supp_reads,
                            sheet_name = 'A375_orig_tracr raw reads', 
                            header = None,
                            skiprows = 3, 
                            names = ['sgRNA Sequence', 'pDNA', 'A375_RepA', 'A375_RepB'], 
                            engine='openpyxl')
guide_annotations = pd.read_excel(supp_reads,
                                  sheet_name='sgRNA annotations', 
                                  engine='openpyxl')
```

The input data has three columns with read counts and one column with sgRNA annotations

```
read_counts.head()
```




<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>sgRNA Sequence</th>
      <th>pDNA</th>
      <th>A375_RepA</th>
      <th>A375_RepB</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>AAAAAAAATCCGGACAATGG</td>
      <td>522</td>
      <td>729</td>
      <td>774</td>
    </tr>
    <tr>
      <th>1</th>
      <td>AAAAAAAGGATGGTGATCAA</td>
      <td>511</td>
      <td>1484</td>
      <td>1393</td>
    </tr>
    <tr>
      <th>2</th>
      <td>AAAAAAATGACATTACTGCA</td>
      <td>467</td>
      <td>375</td>
      <td>603</td>
    </tr>
    <tr>
      <th>3</th>
      <td>AAAAAAATGTCAGTCGAGTG</td>
      <td>200</td>
      <td>737</td>
      <td>506</td>
    </tr>
    <tr>
      <th>4</th>
      <td>AAAAAACACAAGCAAGACCG</td>
      <td>286</td>
      <td>672</td>
      <td>352</td>
    </tr>
  </tbody>
</table>
</div>



```
lognorms = pool.lognorm_columns(reads_df=read_counts, columns=['pDNA', 'A375_RepA', 'A375_RepB'])
filtered_lognorms = pool.filter_pdna(lognorm_df=lognorms, pdna_cols=['pDNA'], z_low=-3)
print('Filtered ' + str(lognorms.shape[0] - filtered_lognorms.shape[0]) + ' columns due to low pDNA abundance')
```

    Filtered 576 columns due to low pDNA abundance


Note that the column names for the lognorms remain the same

```
lognorms.head()
```




<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>sgRNA Sequence</th>
      <th>pDNA</th>
      <th>A375_RepA</th>
      <th>A375_RepB</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>AAAAAAAATCCGGACAATGG</td>
      <td>4.192756</td>
      <td>3.373924</td>
      <td>3.521755</td>
    </tr>
    <tr>
      <th>1</th>
      <td>AAAAAAAGGATGGTGATCAA</td>
      <td>4.163726</td>
      <td>4.326828</td>
      <td>4.312620</td>
    </tr>
    <tr>
      <th>2</th>
      <td>AAAAAAATGACATTACTGCA</td>
      <td>4.041390</td>
      <td>2.540624</td>
      <td>3.196767</td>
    </tr>
    <tr>
      <th>3</th>
      <td>AAAAAAATGTCAGTCGAGTG</td>
      <td>2.930437</td>
      <td>3.388159</td>
      <td>2.973599</td>
    </tr>
    <tr>
      <th>4</th>
      <td>AAAAAACACAAGCAAGACCG</td>
      <td>3.388394</td>
      <td>3.268222</td>
      <td>2.528233</td>
    </tr>
  </tbody>
</table>
</div>



```
lfc_df = pool.calculate_lfcs(lognorm_df=filtered_lognorms, ref_col='pDNA', target_cols=['A375_RepA', 'A375_RepB'])
```

We drop the pDNA column after calculating log-fold changes

```
lfc_df.head()
```




<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>sgRNA Sequence</th>
      <th>A375_RepA</th>
      <th>A375_RepB</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>AAAAAAAATCCGGACAATGG</td>
      <td>-0.818831</td>
      <td>-0.671000</td>
    </tr>
    <tr>
      <th>1</th>
      <td>AAAAAAAGGATGGTGATCAA</td>
      <td>0.163102</td>
      <td>0.148894</td>
    </tr>
    <tr>
      <th>2</th>
      <td>AAAAAAATGACATTACTGCA</td>
      <td>-1.500766</td>
      <td>-0.844622</td>
    </tr>
    <tr>
      <th>3</th>
      <td>AAAAAAATGTCAGTCGAGTG</td>
      <td>0.457721</td>
      <td>0.043161</td>
    </tr>
    <tr>
      <th>4</th>
      <td>AAAAAACACAAGCAAGACCG</td>
      <td>-0.120172</td>
      <td>-0.860161</td>
    </tr>
  </tbody>
</table>
</div>



Since we only have two conditions it's easy to visualize replicates as a point densityplot using [gpplot](https://github.com/gpp-rnd/gpplot)

```
plt.subplots(figsize=(4,4))
gpplot.point_densityplot(data=lfc_df, x='A375_RepA', y='A375_RepB')
gpplot.add_correlation(data=lfc_df, x='A375_RepA', y='A375_RepB')
sns.despine()
```


    
![png](docs/images/output_17_0.png)
    


Since we see a strong correlation, we'll average the log-fold change of each sgRNA across replicates

```
avg_replicate_lfc_df = pool.average_replicate_lfcs(lfcs=lfc_df, guide_col='sgRNA Sequence', condition_indices=[0],
                                                   sep='_')
```

After averaging log-fold changes our dataframe is melted, so the condition column specifies the experimental condition (A375 here) and the n_obs specifies the number of replicates

```
avg_replicate_lfc_df.head()
```




<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>sgRNA Sequence</th>
      <th>condition</th>
      <th>avg_lfc</th>
      <th>n_obs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>AAAAAAAATCCGGACAATGG</td>
      <td>A375</td>
      <td>-0.744916</td>
      <td>2</td>
    </tr>
    <tr>
      <th>1</th>
      <td>AAAAAAAGGATGGTGATCAA</td>
      <td>A375</td>
      <td>0.155998</td>
      <td>2</td>
    </tr>
    <tr>
      <th>2</th>
      <td>AAAAAAATGACATTACTGCA</td>
      <td>A375</td>
      <td>-1.172694</td>
      <td>2</td>
    </tr>
    <tr>
      <th>3</th>
      <td>AAAAAAATGTCAGTCGAGTG</td>
      <td>A375</td>
      <td>0.250441</td>
      <td>2</td>
    </tr>
    <tr>
      <th>4</th>
      <td>AAAAAACACAAGCAAGACCG</td>
      <td>A375</td>
      <td>-0.490166</td>
      <td>2</td>
    </tr>
  </tbody>
</table>
</div>



It's sometimes helpful to group controls into pseudo-genes so they're easier to compare with target genes. Our annotation file maps from sgRNA sequences to gene symbols

```
remapped_annotations = pool.group_pseudogenes(annotations=guide_annotations, pseudogene_size=4, 
                                              gene_col='Annotated Gene Symbol', 
                                              control_regex=['NO_CURRENT'])
remapped_annotations.head()
```




<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>sgRNA Sequence</th>
      <th>Annotated Gene Symbol</th>
      <th>Annotated Gene ID</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>AAAAAAAATCCGGACAATGG</td>
      <td>SLC25A24</td>
      <td>29957</td>
    </tr>
    <tr>
      <th>1</th>
      <td>AAAAAAAGGATGGTGATCAA</td>
      <td>FASTKD3</td>
      <td>79072</td>
    </tr>
    <tr>
      <th>2</th>
      <td>AAAAAAATGACATTACTGCA</td>
      <td>BCAS2</td>
      <td>10286</td>
    </tr>
    <tr>
      <th>3</th>
      <td>AAAAAAATGTCAGTCGAGTG</td>
      <td>GPR18</td>
      <td>2841</td>
    </tr>
    <tr>
      <th>4</th>
      <td>AAAAAACACAAGCAAGACCG</td>
      <td>ZNF470</td>
      <td>388566</td>
    </tr>
  </tbody>
</table>
</div>



We provide two methods for scaling log-fold change values to controls:

1. Z-score from a set of negative controls
2. Scale scores between a set of negative and positive controls

For both scoring methods, you can input either a regex or a list of genes to define control sets

For our set of negative controls, we'll use [nonessential](https://doi.org/10.15252/msb.20145216) genes

```
nonessential_genes = (pd.read_table('https://raw.githubusercontent.com/gpp-rnd/genesets/master/human/non-essential-genes-Hart2014.txt',
                                    names=['gene'])
                      .gene)
annot_guide_lfcs = pool.annotate_guide_lfcs(avg_replicate_lfc_df, remapped_annotations, 'Annotated Gene Symbol',
                                            merge_on='sgRNA Sequence', z_score_neg_ctls=True,
                                            z_score_neg_ctl_genes=nonessential_genes)
annot_guide_lfcs.head()
```




<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>sgRNA Sequence</th>
      <th>condition</th>
      <th>avg_lfc</th>
      <th>n_obs</th>
      <th>Annotated Gene Symbol</th>
      <th>Annotated Gene ID</th>
      <th>z_scored_avg_lfc</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>AAAAAAAATCCGGACAATGG</td>
      <td>A375</td>
      <td>-0.744916</td>
      <td>2</td>
      <td>SLC25A24</td>
      <td>29957</td>
      <td>-1.651921</td>
    </tr>
    <tr>
      <th>1</th>
      <td>AAAAAAAGGATGGTGATCAA</td>
      <td>A375</td>
      <td>0.155998</td>
      <td>2</td>
      <td>FASTKD3</td>
      <td>79072</td>
      <td>0.166585</td>
    </tr>
    <tr>
      <th>2</th>
      <td>AAAAAAATGACATTACTGCA</td>
      <td>A375</td>
      <td>-1.172694</td>
      <td>2</td>
      <td>BCAS2</td>
      <td>10286</td>
      <td>-2.515396</td>
    </tr>
    <tr>
      <th>3</th>
      <td>AAAAAAATGTCAGTCGAGTG</td>
      <td>A375</td>
      <td>0.250441</td>
      <td>2</td>
      <td>GPR18</td>
      <td>2841</td>
      <td>0.357219</td>
    </tr>
    <tr>
      <th>4</th>
      <td>AAAAAACACAAGCAAGACCG</td>
      <td>A375</td>
      <td>-0.490166</td>
      <td>2</td>
      <td>ZNF470</td>
      <td>388566</td>
      <td>-1.137705</td>
    </tr>
  </tbody>
</table>
</div>



To aggregate scores at the gene level, we specify columns to average, and columns to z-score.
From our z-scores we calculate a p-value and FDR using the [Benjamini-Hochberg procedure](http://www.biostathandbook.com/multiplecomparisons.html)


```
gene_lfcs = pool.aggregate_gene_lfcs(annot_guide_lfcs, 'Annotated Gene Symbol',
                                     average_cols=['avg_lfc'],
                                     zscore_cols=['z_scored_avg_lfc'])
gene_lfcs.head()
```




<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>condition</th>
      <th>Annotated Gene Symbol</th>
      <th>n_guides</th>
      <th>avg_lfc</th>
      <th>z_scored_avg_lfc</th>
      <th>z_scored_avg_lfc_p_value</th>
      <th>z_scored_avg_lfc_fdr</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>A375</td>
      <td>A1BG</td>
      <td>3</td>
      <td>-0.084621</td>
      <td>-0.552711</td>
      <td>0.580461</td>
      <td>0.872833</td>
    </tr>
    <tr>
      <th>1</th>
      <td>A375</td>
      <td>A1CF</td>
      <td>4</td>
      <td>0.500314</td>
      <td>1.723181</td>
      <td>0.084856</td>
      <td>0.338055</td>
    </tr>
    <tr>
      <th>2</th>
      <td>A375</td>
      <td>A2M</td>
      <td>4</td>
      <td>0.288629</td>
      <td>0.868604</td>
      <td>0.385064</td>
      <td>0.759829</td>
    </tr>
    <tr>
      <th>3</th>
      <td>A375</td>
      <td>A2ML1</td>
      <td>4</td>
      <td>-0.481786</td>
      <td>-2.241580</td>
      <td>0.024989</td>
      <td>0.132164</td>
    </tr>
    <tr>
      <th>4</th>
      <td>A375</td>
      <td>A3GALT2</td>
      <td>4</td>
      <td>0.059362</td>
      <td>-0.056952</td>
      <td>0.954583</td>
      <td>0.990693</td>
    </tr>
  </tbody>
</table>
</div>



Finally, to evaluate the quality this screen, we'll calculate the ROC-AUC between [essential](https://doi.org/10.1016/j.cell.2015.11.015) and [nonessential](https://doi.org/10.15252/msb.20145216) genes for each condition

```
essential_genes = (pd.read_table('https://raw.githubusercontent.com/gpp-rnd/genesets/master/human/essential-genes-Hart2015.txt', 
                                 names=['gene'])
                   .gene)
roc_aucs, roc_df = pool.get_roc_aucs(lfcs=gene_lfcs, tp_genes=essential_genes, fp_genes=nonessential_genes, gene_col='Annotated Gene Symbol',
                                     condition_col='condition', score_col='avg_lfc')
print('ROC-AUC: ' + str(round(roc_aucs['ROC-AUC'].values[0], 3)))
```

    ROC-AUC: 0.976


Note that we can also use this function to calculate roc-aucs at the guide level

```
annotated_guide_lfcs = lfc_df.merge(guide_annotations, how='inner', on='sgRNA Sequence')
roc_aucs, roc_df = pool.get_roc_aucs(lfcs=annotated_guide_lfcs, tp_genes=essential_genes, fp_genes=nonessential_genes, gene_col='Annotated Gene Symbol',
                                     condition_list=['A375_RepA', 'A375_RepB'])
roc_aucs
```




<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>condition</th>
      <th>ROC-AUC</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>A375_RepA</td>
      <td>0.918505</td>
    </tr>
    <tr>
      <th>1</th>
      <td>A375_RepB</td>
      <td>0.917600</td>
    </tr>
  </tbody>
</table>
</div>



And plot ROC curves from the `roc_df`


```
plt.subplots(figsize=(4,4))
sns.lineplot(data=roc_df, x='fpr', y='tpr', hue='condition', ci=None)
gpplot.add_xy_line()
sns.despine()
```


    
![png](docs/images/output_33_0.png)
    

