Metadata-Version: 2.4
Name: gpe-tokenizer
Version: 0.1.3
Summary: Grapheme Pair Encoding Tokenizer for Sinhala Language
Author: Schizo00
Author-email: naween.k@live.com
Requires-Python: >=3.13, <3.15
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Requires-Dist: datasets (>=4.2.0,<5.0.0)
Requires-Dist: grapheme (>=0.6.0,<0.7.0)
Requires-Dist: huggingface-hub (>=0.35.3,<0.36.0)
Requires-Dist: joblib (>=1.5.2,<2.0.0)
Requires-Dist: regex (>=2025.9.18,<2026.0.0)
Requires-Dist: tokenizers (>=0.22.1)
Requires-Dist: torch (>=2.9.0,<3.0.0)
Requires-Dist: transformers (>=4.57.1,<5.0.0)
Description-Content-Type: text/markdown

## Installation
#### ```pip install gpe-tokenizer```

## Basic Usage
#### ```from gpe_tokenizer import SinhalaGPETokenizer```

### Model Compatibility
#### For BERT
#### ```tokenizer = SinhalaGPETokenizer(model='bert')```

#### For llama
#### ```tokenizer = SinhalaGPETokenizer(model='llama')```

#### For GPT
#### ```tokenizer = SinhalaGPETokenizer(model='gpt')```

### Tokenize
#### ```tokenizer.tokenize(text)```



## Tokenizer Training Details
#### Corpus Size: 10 Million Sentences
#### Vocab Size: 32000
#### Training Time: 13H 29M

