Metadata-Version: 2.4
Name: turkic-translit
Version: 0.3.4
Summary: Deterministic Latin and IPA transliteration for Kazakh, Kyrgyz, Turkish, Azerbaijani, Uyghur, Finnish, plus tokenizer/glue scripts.
Author-email: Austin Wagner <austinwagner@msn.com>
License-Expression: MIT
Keywords: kazakh,kyrgyz,transliteration,ipa
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: epitran<1.27,>=1.0
Requires-Dist: fasttext-wheel==0.9.2
Requires-Dist: numpy<2
Requires-Dist: packaging>=23.0
Requires-Dist: panphon<0.22,>=0.20
Requires-Dist: PyICU<2.16,>=2.13; sys_platform != "win32"
Requires-Dist: pytest>=8.0
Requires-Dist: rapidfuzz>=3.5
Requires-Dist: rich>=13.7
Requires-Dist: python-json-logger>=2.0.4
Requires-Dist: sentencepiece>=0.2.0
Requires-Dist: click>=8.1
Requires-Dist: types-requests>=2.31
Requires-Dist: datasets<4.0.0,>=3.6.0
Requires-Dist: evaluate>=0.4
Requires-Dist: accelerate>=0.28
Requires-Dist: transformers<5,>=4.41
Requires-Dist: tqdm>=4.66
Requires-Dist: scikit-learn>=1.4
Requires-Dist: typing_extensions>=4.12
Requires-Dist: wikipedia<2.0.0,>=1.4.0
Requires-Dist: gradio>=4.0
Requires-Dist: pandas
Requires-Dist: matplotlib>=3.9
Requires-Dist: pycountry>=23.12
Requires-Dist: zstandard>=0.23
Provides-Extra: examples
Requires-Dist: flask; extra == "examples"
Requires-Dist: streamlit; extra == "examples"
Requires-Dist: jupyterlab; extra == "examples"
Provides-Extra: winlid
Provides-Extra: corpus
Requires-Dist: datasets>=3.0; extra == "corpus"
Requires-Dist: pyarrow>=14.0; extra == "corpus"
Requires-Dist: requests>=2.0; extra == "corpus"
Provides-Extra: sentry
Requires-Dist: sentry-sdk>=2.0; extra == "sentry"
Provides-Extra: dev
Requires-Dist: ruff>=0.2.0; extra == "dev"
Requires-Dist: mypy>=1.0; extra == "dev"
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: twine>=4.0; extra == "dev"
Requires-Dist: make>=0.1.6; extra == "dev"
Requires-Dist: types-requests>=2.31; extra == "dev"
Dynamic: license-file

---
title: Turkic Transliteration Demo
emoji: 🌖
colorFrom: green
colorTo: green
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: Transliteration of Kazakh & Kyrgyz into Latin and IPA
---

turkic\_transliterate
Deterministic Latin and IPA transliteration for Kazakh and Kyrgyz, plus helper utilities for tokenizer training and Russian-token filtering.

Quick install

1. Install Miniconda or Anaconda (recommended).
2. Clone the repo and create the environment:
   conda env create -f env.yml
3. Activate the environment:
   conda activate turkic
4. Run the verification tests:
   python -m pytest      (all tests should pass)

Python compatibility
• Works on CPython 3.10 and 3.11.
• CPython 3.12+ is supported everywhere except on Windows until official PyICU wheels are available; see “Windows & PyICU” below.

Package names
• Runtime import path:  turkic\_translit
• Distributable name on PyPI:  turkic\_transliterate
• Command-line entry point:  turkic-translit

## Developer Setup

For the simplest developer setup experience, run the setup script:

```bash
python scripts/setup_dev.py
```

This script will:
1. Install the package with all development dependencies
2. Set up PyICU on Windows automatically
3. Verify that development tools are working properly

### Manual Installation

Alternatively, install with pip:

```bash
pip install -e .[dev,ui]        # add ,winlid on Windows if you need fasttext-wheel
```

### Development Tools

#### Linux/macOS/Windows with GNU Make

If you have GNU Make installed, you can use the Makefile for common tasks:

```bash
make lint       # Run linting (ruff, black, mypy)
make format     # Auto-format code
make test       # Run tests
make web        # Launch the web UI
make help       # Show all available commands
```

#### Windows

**Option 1: Install GNU Make using Chocolatey (Recommended)**

Install GNU Make using Chocolatey (requires admin privileges):

```powershell
# In an Admin PowerShell window
choco install make
```

After installation, you can use the same `make` commands as on Linux/macOS.

**Option 2: Use the PowerShell Script Alternative**

If you prefer not to install Chocolatey or GNU Make, use the PowerShell script:

```powershell
./scripts/run.ps1 lint       # Run linting
./scripts/run.ps1 format     # Auto-format code
./scripts/run.ps1 test       # Run tests
./scripts/run.ps1 web        # Launch the web UI
./scripts/run.ps1 help       # Show all available commands
```

Optional extras
dev   → black, ruff, pytest
ui    → gradio web demo
winlid (Windows only) → fasttext-wheel for language ID

Windows & PyICU

**Important:** Due to PyPI rules, the correct PyICU wheel for Windows cannot be installed automatically during pip install. After installing this package with pip, Windows users must run the helper script to install the appropriate PyICU wheel:

    turkic-pyicu-install

This script will download and install the correct PyICU wheel from Christoph Gohlke’s repository based on your Python version. See the script for details.

Command-line usage
turkic-translit --lang kk --in text.txt --out\_latin kk\_lat.txt --ipa --out\_ipa kk\_ipa.txt --arabic --log-level debug
• --lang            kk or ky
• --ipa             emit IPA alongside Latin
• --arabic          also transliterate embedded Arabic script
• --benchmark       print throughput statistics
• --log-level       debug | info | warning | error | critical (default: info)

Logging
Central logging supports structured JSON with correlation IDs and stack traces. Control verbosity with `TURKIC_LOG_LEVEL` (DEBUG, INFO, WARNING, ERROR). Format via `TURKIC_LOG_FORMAT=json|rich` (default json). Entry points configure logging; libraries can call `turkic_translit.logging_config.setup()` to adopt the same config.

Error service
Optional Sentry integration via `TURKIC_SENTRY_DSN` (and `TURKIC_ENV`, `TURKIC_SENTRY_TRACES`). Install with `pip install turkic-translit[sentry]`. Correlation IDs are generated per request/command; you can also set a fixed one using `TURKIC_CORRELATION_ID`.

# Project Organization

The project is organized into the following directories:

- `src/turkic_translit/` - Core source code for the package
- `examples/` - Example scripts showing how to use the package
  - `examples/web/` - Web interface for demonstrating transliteration features
- `data/` - Sample data files and language resources
- `docs/` - Documentation and reference materials
- `scripts/` - Utility scripts for development and release
  - `scripts/release/` - Scripts for building and publishing packages
- `vendor/pyicu/` - Pre-built PyICU wheels for Windows
- `tests/` - Test suite for the package

## FastText Language Identification Model

This package uses the [FastText language identification model](https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin) (`lid.176.bin`) for Russian token filtering and language detection. **The model file is not included in the repository or pip package due to its large size.**

**Automatic Download:**
- When you use features that require language identification (such as Russian token filtering or the Gradio web demo), the package will automatically download `lid.176.bin` from the official Facebook AI public link if it is not already present.
- The file will be saved in the package directory on first use.

**No manual action is needed.** This ensures compatibility with pip installs, Hugging Face Spaces, and other cloud environments.

If you need to download the model manually, you can do so from:
https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin


## Using the Examples

Use the main entry point script to run examples:

```bash
python turkic_tools.py [command]
```

Available commands:
- `web` - Launch the Gradio web interface for real-time transliteration
- `demo` - Run the simple CLI demo
- `full-demo` - Run the comprehensive demo with multiple languages
- `help` - Display available commands

Tokenizer training example
turkic-build-spm --input corpora/kk\_lat.txt,corpora/ky\_lat.txt --model\_prefix spm/turkic12k --vocab\_size 12000

Filtering Russian tokens from Uzbek
cat uz\_raw\.txt | turkic-filter-russian --mode drop > uz\_clean.txt

Developer checklist
black .
ruff check .
pytest -q

All code is UTF-8-only; on Windows a BOM is written when piping to files to avoid encoding issues.

License
Apache-2.0

### Type-checking

```bash
pip install mypy
mypy --strict .
```

The included mypy.ini restricts analysis to the src/ tree and skips
build/, dist/, virtual-env and egg directories so duplicate-module
errors do not occur even if you build wheels locally.
