Metadata-Version: 2.3
Name: mon-tokenizer
Version: 0.1.3
Summary: A simple tokenizer for Mon text
Keywords: mon,tokenizer,nlp,myanmar,text-processing
Author: Code-Yay-Mal
Author-email: Code-Yay-Mal <jnovaxer@gmail.com>
License: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Dist: sentencepiece>=0.1.99
Requires-Dist: click>=8.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: twine>=6.1.0
Requires-Dist: pytest>=7.0.0 ; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0 ; extra == 'dev'
Requires-Dist: black>=23.0.0 ; extra == 'dev'
Requires-Dist: isort>=5.12.0 ; extra == 'dev'
Requires-Dist: mypy>=1.0.0 ; extra == 'dev'
Requires-Dist: ruff>=0.1.0 ; extra == 'dev'
Requires-Dist: sphinx>=7.0.0 ; extra == 'docs'
Requires-Dist: sphinx-rtd-theme>=1.3.0 ; extra == 'docs'
Requires-Python: >=3.11
Project-URL: Changelog, https://github.com/Code-Yay-Mal/mon_tokenizer/blob/main/CHANGELOG.md
Project-URL: Documentation, https://github.com/Code-Yay-Mal/mon_tokenizer#readme
Project-URL: Homepage, https://github.com/Code-Yay-Mal/mon_tokenizer
Project-URL: Issues, https://github.com/Code-Yay-Mal/mon_tokenizer/issues
Project-URL: Repository, https://github.com/Code-Yay-Mal/mon_tokenizer
Provides-Extra: dev
Provides-Extra: docs
Description-Content-Type: text/markdown

# Mon Tokenizer

Tokenize Mon text like a pro. No fancy stuff, just gets the job done.

## Quick Start

```bash
# Using pip
pip install mon-tokenizer

# Using uv (faster)
uv add mon-tokenizer
```

```python
from mon_tokenizer import MonTokenizer

tokenizer = MonTokenizer()
text = "ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။"

# Tokenize
result = tokenizer.encode(text)
print(result["pieces"])  # ['▁ဂွံ', 'အခေါင်', 'အရာ', 'မွဲ', 'သ္ဂောံ', 'ဒုင်စသိုင်', 'ကၠာ', 'ကၠာ', 'ရ', '။']

# Decode
decoded = tokenizer.decode(result["pieces"])
print(decoded)  # ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။
```

## CLI

```bash
# Tokenize
mon-tokenizer "ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။"

# Verbose mode (shows all the details)
mon-tokenizer -v "ဂွံအခေါင်အရာမွဲသ္ဂောံဒုင်စသိုင်ကၠာကၠာရ။"

# Decode tokens back to text
mon-tokenizer -d -t "▁ဂွံ,အခေါင်,အရာ,မွဲ,သ္ဂောံ,ဒုင်စသိုင်,ကၠာ,ကၠာ,ရ,။"
```

## API

- `encode(text)` - Chop text into tokens
- `decode(pieces)` - Glue tokens back together
- `decode_ids(ids)` - Convert IDs back to text
- `get_vocab_size()` - How many tokens we know
- `get_vocab()` - The whole vocabulary

## Dev Setup

```bash
git clone git@github.com:janakhpon/mon_tokenizer.git
cd mon_tokenizer
uv sync --dev
uv run pytest

# Release workflow
uv version --bump patch
git add pyproject.toml
git commit -m "v0.1.1"
git tag v0.1.1
git push origin main --tags
```

## License

MIT - do whatever you want with it.
