# CoMBCR
## Introduction
CoMBCR is an innovative B-cell embedding method designed to integrate multi-modal data from B cells, particularly BCRs and gene expressions, within a co-learning framework. By accepting paired BCR sequences and gene expression profiles as input, CoMBCR effectively integrates these two modalities to produce joint representations for each B cell, focusing specifically on the heavy chain of BCRs. 
## Prerequisites
CoMBCR is implemented in Python and requires a GPU for the acceleration. 

We recommend the versions of the following packages:  
- Pytorch (2.4.1)  
- Transformers (4.41.2)  
- Numpy (1.26.4)  
- Pandas (2.2.3)  
- Scikit-learn (1.5.1)  
- huggingface_hub by ```python3 -m pip install huggingface_hub```

## Installation
Install CoMBCR using pip:

```
pip3 install CoMBCR
```
Then, install the default pre-trained encoder (The code only need to be executed once when install CoMBCR):
```
from CoMBCR.utils import download_BCRencoder
download_BCRencoder()
```
## Tutorial
We provide a [tutorial](./tutorial.ipynb) for the usage of CoMBCR. The following usage section is for the current version of CoMBCR.

Please refer to [tutorial_pair](./tutorial_pair.ipynb) if you want to use the paired chains. Kindly noted that the paired-chains will cost double computational resources and the performance won't increase significantly according to the current tested outcomes. 

## Usage
> ### Prepare input data
> CoMBCR integrates BCRs and gene expressions but requires three files: a BCR sequences file, a gene expression file, and a file containing BCR embeddings generated by a BCR encoder (e.g., AntiBERTa, ESM2).  
> - Ensure each file includes an index column labeled "barcode," serving as a unique identifier for each cell.   
> - Verify that the cells are aligned in the same order across all three files.
>> #### BCR sequences file
>> This CSV file should include an index column named "barcode" and columns labeled "fwr1", "cdr1", "fwr2", "cdr2", "fwr3", "cdr3" and "fwr4". The file should resemble the example shown below: ![](images/BCRs.png)
>> #### Gene expression file
>> Normalization and log-transformation are recommended. Batch effect removal is advisable if applicable. We suggest using the top 5,000 highly variable genes, though you can select input genes according to your criteria.
>> #### Original BCR embeddings file
>> Please clone or download the "runberta.py" in this github. This file is used to measure the original distances between BCRs. We recommend using our default pre-trained encoder, though any encoder can be used to encode BCRs. 
>> ```
>> python3 runberta.py --datapath "exampledata/example_bcr.csv" --outdir "example_outdir" --outfilename "antiberta_embedding.csv"
>> ```
>> The code generates an original BCR embedding file named "antiberta_embedding.csv" under the outdir.
> ### Quick run
>> To quickly run CoMBCR, use the following code:  
>> ```
>> from CoMBCR.CoMBCR import CoMBCR_main
>> bcremb, gexemb = CoMBCR_main(bcrpath="exampledata/example_bcr.csv", 
>>            rnapath="exampledata/example_rna.csv", 
>>            bcroriginal="exampledata/example_bcrori.csv", 
>>            outdir="example_outdir",
>>            epochs=1,  # You can revise the epochs here. Default if 200.
>>            batch_size=32,
>>            encoderprofile_in_dim=5000)
>> ```
>> This code returns numpy arrays for BCR embeddings and gene expression embeddings, and outputs "bcrembedding.csv" and "gexembedding.csv" in the specified output directory.  
>> Please note that these CSV files directly store the numpy arrays as the order of your input cells, and, as such, do not include any "barcode" column. When reading these files, ensure that you do not specify any index column.
> ### Parameters of CoMBCR
>> | Parameter | Description |
>> | ------------- | ------------- |
>> | **bcrpath** | (Required) The path to the BCR sequences file.|
>> | **rnapath** | (Required) The path to the gene expression file.|
>> | **bcroriginal**| (Required) The path to the BCR original embedding file.|
>> |**outdir**|(Required) The directory where the best checkpoint file and the output embeddings will be stored.|
>> |**checkpoint**|Default is "best_network.pth". This parameter specifies the name of the file where the best model checkpoint will be saved.|
>> |**lr**|Default is 1e-6.|
>> |**lam**| Default is 1e-1, the inner parameter (Parameter alpha in the paper).|
>> |**batch_size** | Default is 256.|
>> |**epochs** | Default is 200.|
>> |**patience**| Default is 15, the patience for early stopping.|
>> |**lr_step** | Default is [30,100]. These are the milestones for the MultiStepLR setting, which adjusts the learning rate at specified epochs.|
>> |**encoderprofile_in_dim**| Default is 5000. Adjust this parameter if the number of input genes differs from 5000.|
>> |**separatebatch**|The default is False. If set to True, BCRs from different samples will be treated as distinct BCRs. Ensure that your BCR input file contains a "sample" column if you choose to enable this option. |

## Acknowledgements
The code was based in part on the source code of [UniTCR](https://github.com/bm2-lab/UniTCR/tree/main).
## Questions
If you encounter issues installing or using CoMBCR, please feel free to open a issue or contact me via [email](yipingzou2-c@my.cityu.edu.hk).

