Metadata-Version: 2.4
Name: precious-nlp
Version: 0.1.0
Summary: A tokenizer-free NLP library with T-FREE, CANINE, and byte-level approaches
Home-page: https://github.com/bimri/precious
Author: bimri
Author-email: bimri <bimri@outlook.com>
Maintainer-email: bimri <bimri@outlook.com>
License: MIT
Project-URL: Homepage, https://github.com/bimri/precious
Project-URL: Repository, https://github.com/bimri/precious
Project-URL: Documentation, https://github.com/bimri/precious/blob/master/docs/API_REFERENCE.md
Project-URL: Bug Reports, https://github.com/bimri/precious/issues
Project-URL: Changelog, https://github.com/bimri/precious/blob/master/CHANGELOG.md
Keywords: tokenization,nlp,transformers,tokenizer-free,canine,tfree,byte-level,natural-language-processing,deep-learning,pytorch
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=1.9.0
Requires-Dist: numpy>=1.19.0
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: pytest-cov>=2.0; extra == "dev"
Requires-Dist: black>=22.0; extra == "dev"
Requires-Dist: isort>=5.0; extra == "dev"
Requires-Dist: flake8>=4.0; extra == "dev"
Provides-Extra: benchmarks
Requires-Dist: psutil>=5.8.0; extra == "benchmarks"
Provides-Extra: docs
Requires-Dist: sphinx>=4.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.0; extra == "docs"
Provides-Extra: all
Requires-Dist: pytest>=6.0; extra == "all"
Requires-Dist: pytest-cov>=2.0; extra == "all"
Requires-Dist: black>=22.0; extra == "all"
Requires-Dist: isort>=5.0; extra == "all"
Requires-Dist: flake8>=4.0; extra == "all"
Requires-Dist: psutil>=5.8.0; extra == "all"
Requires-Dist: sphinx>=4.0; extra == "all"
Requires-Dist: sphinx-rtd-theme>=1.0; extra == "all"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# Precious Package

## Overview
The Precious package provides a minimal model showcasing three tokenizer-free approaches for natural language processing tasks. It includes implementations for T-FREE, CANINE, and byte-level embeddings, along with attention mechanisms for enhanced performance.

## Installation

### From PyPI (Recommended)
```bash
pip install precious-nlp
```

### From Source (Development)
```bash
git clone https://github.com/bimri/precious.git
cd precious
pip install -e .
```

### With Optional Dependencies
```bash
# For development tools
pip install precious-nlp[dev]

# For benchmarking
pip install precious-nlp[benchmarks]

# For documentation
pip install precious-nlp[docs]

# All optional dependencies
pip install precious-nlp[all]
```

## Usage
Here is a basic example of how to use the PreciousModel:

```python
# Import with hyphenated package name
import precious_nlp as precious
from precious_nlp import PreciousModel, PreciousConfig

# Initialize the model with the desired configuration
config = PreciousConfig(mode="byte", d_model=256)  # or "tfree", "canine"
model = PreciousModel(config)

# Prepare your input data
inputs = ["Hello, tokenizer-free world!"]
outputs = model(inputs)

# Access the logits
logits = outputs["logits"]
print(f"Output shape: {logits.shape}")  # [batch_size, seq_len, vocab_size]

# Training with targets
targets = ["Hello, tokenizer-free universe!"]
outputs = model(inputs, targets=targets)
loss = outputs["loss"]
print(f"Training loss: {loss.item()}")
```

## Three Tokenizer-Free Approaches

### 1. Byte-Level Processing
```python
import precious_nlp as precious
config = precious.PreciousConfig(mode="byte", d_model=256)
model = precious.PreciousModel(config)
# Processes text at byte level - universal and memory efficient
```

### 2. CANINE Approach
```python
import precious_nlp as precious
config = precious.PreciousConfig(mode="canine", d_model=256)
model = precious.PreciousModel(config)
# Character-level processing with Unicode support
```

### 3. T-FREE Method
```python
import precious_nlp as precious
config = precious.PreciousConfig(mode="tfree", d_model=256, tfree_vocab_v=8192)
model = precious.PreciousModel(config)
# Vocabulary-aware with character-level fallback
```

## Key Features

- 🚀 **Three tokenizer-free approaches** in one unified library
- 🎯 **Production-ready** with comprehensive testing and documentation  
- 🌍 **Universal text support** - handles any Unicode text
- ⚡ **Efficient processing** with configurable model architectures
- 🧪 **Research-friendly** with benchmarking and comparison tools
- 📚 **Well-documented** with extensive examples and API reference

## Quick Performance Comparison

| Mode | Memory | Speed | Best For |
|------|--------|-------|----------|
| Byte | Lowest | Fastest | General purpose, production |
| CANINE | Medium | Medium | Multilingual, character-aware |
| T-FREE | Highest | Research | Vocabulary analysis, interpretability |

## Documentation

- 📖 [API Reference](docs/API_REFERENCE.md) - Complete API documentation
- 📝 [Examples](docs/EXAMPLES.md) - From basic to advanced usage
- 🔧 [Implementation Details](docs/IMPLEMENTATION_SUMMARY.md) - Technical overview

## Requirements

- Python >= 3.8
- PyTorch >= 1.9.0
- NumPy >= 1.19.0

## Contributing
Contributions are welcome! Please follow these steps to contribute:

1. Fork the repository.
2. Create a new branch for your feature or bug fix.
3. Make your changes and commit them.
4. Push your branch and create a pull request.

## License
This project is licensed under the MIT License. See the LICENSE file for more details.
