Metadata-Version: 2.4
Name: olaph
Version: 0.1.5
Summary: A multilingual phonemizer combining lexica, NLP, and probabilistic scoring for improved phonemization accuracy..
Author-email: Johannes Wirth <johannes.wirth.3@iisys.de>
License: MIT
Project-URL: Homepage, https://github.com/iisys-hof/olaph
Project-URL: Documentation, https://github.com/iisys-hof/olaph#readme
Project-URL: Issues, https://github.com/iisys-hof/olaph/issues
Keywords: phonemizer,text-to-speech,linguistics,NLP,multilingual
Classifier: Intended Audience :: Developers
Classifier: Topic :: Text Processing :: Linguistic
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: inflect==7.5.0
Requires-Dist: spacy
Requires-Dist: lingua-language-detector==2.1.1
Requires-Dist: num2words==0.5.14
Requires-Dist: requests==2.32.5
Requires-Dist: annotated-types==0.7.0
Requires-Dist: blis==1.3.0
Requires-Dist: catalogue==2.0.10
Requires-Dist: certifi==2025.10.5
Requires-Dist: charset-normalizer==3.4.3
Requires-Dist: click==8.3.0
Requires-Dist: cloudpathlib==0.22.0
Requires-Dist: colorama==0.4.6
Requires-Dist: confection==0.1.5
Requires-Dist: cymem==2.0.11
Requires-Dist: docopt==0.6.2
Requires-Dist: idna==3.10
Requires-Dist: jinja2==3.1.6
Requires-Dist: langcodes==3.5.0
Requires-Dist: language-data==1.3.0
Requires-Dist: marisa-trie==1.3.1
Requires-Dist: markdown-it-py==4.0.0
Requires-Dist: markupsafe==3.0.3
Requires-Dist: mdurl==0.1.2
Requires-Dist: murmurhash==1.0.13
Requires-Dist: numpy==2.2.0
Requires-Dist: packaging==25.0
Requires-Dist: preshed==3.0.10
Requires-Dist: pydantic==2.11.10
Requires-Dist: pydantic-core==2.33.2
Requires-Dist: pygments==2.19.2
Requires-Dist: rich==14.1.0
Requires-Dist: setuptools==80.9.0
Requires-Dist: shellingham==1.5.4
Requires-Dist: smart-open==7.3.1
Requires-Dist: spacy-legacy==3.0.12
Requires-Dist: spacy-loggers==1.0.5
Requires-Dist: srsly==2.5.1
Requires-Dist: thinc==8.3.6
Requires-Dist: tqdm==4.67.1
Requires-Dist: typer==0.19.2
Requires-Dist: typing-inspection==0.4.2
Requires-Dist: typing-extensions==4.15.0
Requires-Dist: urllib3==2.5.0
Requires-Dist: wasabi==1.1.3
Requires-Dist: weasel==0.4.1
Requires-Dist: wrapt==1.17.3
Requires-Dist: pytest
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: flake8; extra == "dev"
Dynamic: license-file

# OLaPh — Optimal Language Phonemizer

[![PyPI version](https://img.shields.io/pypi/v/olaph.svg?logo=pypi)](https://pypi.org/project/olaph/)
[![Python versions](https://img.shields.io/pypi/pyversions/olaph.svg)](https://pypi.org/project/olaph/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)

**OLaPh (Optimal Language Phonemizer)** is a multilingual phonemization framework that converts text into phonemes surpassing the quality of comparable frameworks.

---

## Overview

Traditional phonemizers rely on simple rule-based mappings or lexicon lookups.
Neural and hybrid approaches improve generalization but still struggle with:

- Names and foreign words
- Abbreviations and acronyms
- Loanwords and compounds
- Ambiguous homographs

**OLaPh** tackles these challenges by combining:

- Extensive **language-specific dictionaries**
- **Abbreviation, number, and letter normalization**
- **Compound resolution with probabilistic scoring**
- **Cross-language handling**
- **NLP-based preprocessing** via [spaCy](https://spacy.io) and [Lingua](https://github.com/pemistahl/lingua-py)

Evaluations in **German** and **English** show improved accuracy and robustness over existing phonemizers, including on challenging multilingual datasets.

---

## Features

- Multilingual phonemization (DE, EN, FR, ES)
- Abbreviation and letter pronunciation dictionaries
- Number normalization
- Cross-language acronym detection
- Compound splitting with probabilistic scoring
- Freely available lexica for research and development derived from wiktionary.org.

## Large Language Model
A LLM based on OLaPh output is also available. It is a GemmaX 2B Model trained on ~10M sentences derived from the FineWeb Corpus phonemized with the OLaPh framework.

Find it here on [huggingface](https://huggingface.co/iisys-hof/olaph)

---

## Installation

### From PyPI

```bash
pip install olaph
python -m spacy download de_core_news_sm
python -m spacy download en_core_web_sm
python -m spacy download es_core_news_sm
python -m spacy download fr_core_news_sm

```

### From source

```bash
git clone https://github.com/iisys-hof/olaph.git
cd olaph
pip install -e .
python -m spacy download de_core_news_sm
python -m spacy download en_core_web_sm
python -m spacy download es_core_news_sm
python -m spacy download fr_core_news_sm
```

## Example Usage

```python
from olaph import Olaph

phonemizer = Olaph()

output = phonemizer.phonemize_text("He ordered a Brezel and a beer in a tavern near München.", lang="en")

print(output)
```

---

## Dependencies

- [spaCy](https://spacy.io)
- [Lingua](https://github.com/pemistahl/lingua-py)
- [num2words](https://github.com/savoirfairelinux/num2words)
- [inflect](https://github.com/jaraco/inflect)

---

## Research Summary

Phonemization, the conversion of text into phonemes, is a key step in text-to-speech. Traditional approaches use rule-based transformations and lexicon lookups, while more advanced methods apply preprocessing techniques or neural networks for improved accuracy on out-of-domain vocabulary. However, all systems struggle with names, loanwords, abbreviations, and homographs. This work presents OLaPh (Optimal Language Phonemizer), a framework that combines large lexica, multiple NLP techniques, and compound resolution with a probabilistic scoring function. Evaluations in German and English show improved accuracy over previous approaches, including on a challenging dataset. To further address unresolved cases, we train a large language model on OLaPh-generated data, which achieves even stronger generalization and performance. Together, the framework and LLM improve phonemization consistency and provide a freely available resource for future research.

---

## Citation

If you use OLaPh in academic work, please cite:

```bibtex
@misc{wirth2025olaphoptimallanguagephonemizer,
      title={OLaPh: Optimal Language Phonemizer},
      author={Johannes Wirth},
      year={2025},
      eprint={2509.20086},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.20086},
}
```
