# PyHanNom

## 📖 Introduction  
**PyHanNom** is a Python package dedicated to the **modernization and digitization of Han-Nom characters**, the classical script system of Vietnam. In the era of **artificial intelligence**, where Python has become the dominant language for computation and research, it is essential to provide robust tools that connect **modern Vietnamese Latin script (Quốc Ngữ)** with its **Han-Nom heritage**.  

This package offers a foundation for computational work with Han-Nom by enabling **bidirectional lookup and conversion ↔️** at both the syllable, character, and word levels. Beyond simple character matching, PyHanNom is designed as part of a broader effort to make Han-Nom resources **machine-readable 💻, searchable 🔍, and ready for integration into AI-driven applications 🚀**.  

By bridging modern orthography with historical scripts, PyHanNom contributes to:  
- 🏛️ the **preservation and revival** of Han-Nom in digital form,  
- 📚 the **development of linguistic resources** for computational linguistics and digital humanities,  
- 🔮 and the **future of AI applications**, such as building parallel corpora between Latin Vietnamese and Han-Nom, or supporting natural language processing tasks.  

PyHanNom is therefore not only a practical tool for researchers and developers today, but also a step toward the **long-term goal of bringing Han-Nom into the digital and AI era 🌏**.

---

## 📂 Data Source
- This project includes data derived from the **"委班復生漢喃越南 Uỷ ban Phục sinh Hán Nôm Việt Nam"** project:  
  - https://www.hannom-rcv.org/BCHNCTD.html  
  - https://www.hannom-rcv.org/Lookup-CHNC.html  
- All rights to the original data belong to the *委班復生漢喃越南 Ủy ban Phục sinh Hán Nôm Việt Nam*.  
- The data included in this package is provided strictly for **research and educational purposes**. Commercial use of the data is **NOT permitted**.  
- Since the mappings are directly derived from this source, **all case handling and annotations in PyHanNom remain exactly consistent with the original data**. For details, please refer to the original source above.  

---

## ⚙️ Installation
```bash
pip install pyhannom
```

---

## 🚀 Usage

### 1. Load the syllable–character table
Before using any lookup functions, you need to load the Han-Nom syllable–character mapping table:

```python
from pyhannom import load_syllable_char_table

hannom_syllable_char_table = load_syllable_char_table()
```

This `hannom_syllable_char_table` handle must be passed into the following functions.

#### Function signature
```python
load_syllable_char_table()
```

- **returns**: a syllable/character-level mapping table to be used with syllable/character-level functions.

---

### 2. Convert Latin syllable → Han-Nom character
Use `get_chuhannom_from_latin` to retrieve the Han-Nom character(s) corresponding to a given Vietnamese Latin syllable.

Each Han-Nom character in the bracket is the simplified version of the one outside the bracket.

```python
from pyhannom import get_chuhannom_from_latin

result = get_chuhannom_from_latin(
    hannom_syllable_char_table,
    "buộc"
)
print(result)  # e.g. ['𫃚']
```

```python
from pyhannom import get_chuhannom_from_latin

result = get_chuhannom_from_latin(
    hannom_syllable_char_table,
    "anh"
)
print(result)  # e.g. ['英', '英', '映', '罌', '嚶', '櫻', '鶯（𦾉）', '鸚']
```

#### Function signature
```python
get_chuhannom_from_latin(
    handle,
    input_latin_syllable: str,
    normalize_input_case: bool = True,
    case_insensitive_match: bool = True
)
```

- **handle**: the syllable–character table loaded by `load_syllable_char_table`.
- **input_latin_syllable** *(str)*: a single Vietnamese syllable in Latin script.
- **normalize_input_case** *(bool, default=True)*: whether to normalize the input to lowercase before matching.
- **case_insensitive_match** *(bool, default=True)*: whether to perform case-insensitive matching.
- **returns**: a list of the matched Han-Nom characters corresponding to the input syllable.

---

### 3. Convert Latin syllable → Han-Nom Unicode code points
Use `get_chuhannom_unicode_from_latin` to retrieve the **Unicode code points** of the Han-Nom character(s) corresponding to a given Vietnamese Latin syllable.  

This function has the same input parameters as `get_chuhannom_from_latin`, but instead of returning the characters themselves, it returns their Unicode representations.

Each Unicode code point in the bracket represents the Unicode code point of the Han-Nom character in the bracket introduced in the last section.

```python
from pyhannom import get_chuhannom_unicode_from_latin

result = get_chuhannom_unicode_from_latin(
    hannom_syllable_char_table,
    "buộc"
)
print(result)  # e.g. ['U+2B0DA']
```

```python
from pyhannom import get_chuhannom_unicode_from_latin

result = get_chuhannom_unicode_from_latin(
    hannom_syllable_char_table,
    "anh"
)
print(result)  
# e.g. ['U+82F1', 'U+82F1', 'U+6620', 'U+7F4C', 'U+56B6', 'U+6AFB', 'U+9DAF (U+26F89)', 'U+9E1A']
```

#### Function signature
```python
get_chuhannom_unicode_from_latin(
    handle,
    input_latin_syllable: str,
    normalize_input_case: bool = True,
    case_insensitive_match: bool = True
)
```

- **handle**: the syllable–character table loaded by `load_syllable_char_table`.
- **input_latin_syllable** *(str)*: a single Vietnamese syllable in Latin script.
- **normalize_input_case** *(bool, default=True)*: whether to normalize the input to lowercase before matching.
- **case_insensitive_match** *(bool, default=True)*: whether to perform case-insensitive matching.
- **returns**: a list of Unicode code points (as strings) corresponding to the matched Han-Nom characters.

---

### 4. Convert Han-Nom character → Latin syllable
Use `get_latin_from_chuhannom` to retrieve the Vietnamese Latin syllable(s) corresponding to a given Han-Nom character.

```python
from pyhannom import get_latin_from_chuhannom

result = get_latin_from_chuhannom(
    hannom_syllable_char_table,
    "心"
)
print(result)  # e.g. ['TÂM', 'tim']
```

#### Function signature
```python
get_latin_from_chuhannom(
    handle,
    input_chuhannom: str,
    normalize_output_case: bool = False
)
```

- **handle**: the syllable–character table loaded by `load_syllable_char_table`.  
- **input_chuhannom** *(str)*: a single Han-Nom character.  
- **normalize_output_case** *(bool, default=False)*: whether to normalize all returned Latin syllables to lowercase.  
- **returns**: a list of corresponding Vietnamese Latin syllables.

---

### 5. Load the word-level table
Before using word-level lookup functions, you need to load the Han-Nom **word-level mapping table**.  

This table is built on top of the syllable–character table:

```python
from pyhannom import load_word_table

hannom_word_table = load_word_table(hannom_syllable_char_table)
```

#### Function signature
```python
load_word_table(handle)
```

- **handle**: the syllable–character table loaded by `load_syllable_char_table`.  
- **returns**: a word-level mapping table to be used with word-level functions.

---

### 6. Convert Latin syllables → Han-Nom words
Use `get_chuhannom_word_from_latin` to retrieve Han-Nom word(s) corresponding to a given sequence of Vietnamese Latin syllables.  

The function checks whether all provided Latin syllables occur as substrings within a Latin word in the word-level dictionary. If this condition is satisfied, it returns all matching Han-Nom words together with their Latin equivalents (and optional annotations).
```python
from pyhannom import get_chuhannom_word_from_latin

result = get_chuhannom_word_from_latin(
    hannom_word_table,
    "ác tâ"
)
print(result)
# e.g. {('革新', 'cách tân'), ('賓客', 'tân khách'), ('惡心', 'ác tâm')}
```

```python
from pyhannom import get_chuhannom_word_from_latin

result = get_chuhannom_word_from_latin(
    hannom_word_table,
    "hưng hửng"
)
print(result)
# e.g. {('烝𬋙', 'chưng hửng', '[𠸨]'), ('𬋙𬋙', 'hưng hửng', '[𠸨]')}
```

#### Function signature
```python
get_chuhannom_word_from_latin(
    handle,
    input_latin_syllables: str,
    normalize_input_case: bool = False,
    case_insensitive_match: bool = True
)
```

- **handle**: the word-level table loaded by `load_word_table`.  
- **input_latin_syllables** *(str)*: one or more Vietnamese Latin syllables.  
- **normalize_input_case** *(bool, default=False)*: whether to normalize the input to lowercase before matching.  
- **case_insensitive_match** *(bool, default=True)*: whether to perform case-insensitive matching.  
- **returns**: a `set` of tuples. Each tuple contains:  
  1. Han-Nom word (string)  
  2. Corresponding Latin word (string)  
  3. Optional annotation (string, may not always be present)

---

### 7. Convert Han-Nom words → Latin words
Use `get_latin_word_from_chuhannom` to retrieve the Vietnamese Latin word(s) corresponding to a given Han-Nom word or phrase.  
The function checks whether the provided Han-Nom string (one or more characters) occurs as a substring within any Han-Nom word in the word-level dictionary. If this condition is satisfied, it returns all matching words together with their Latin equivalents (and optional annotations).

```python
from pyhannom import get_latin_word_from_chuhannom

result = get_latin_word_from_chuhannom(
    hannom_word_table,
    "稱雄"
)
print(result)
# e.g. {('稱雄', 'xưng hùng'), ('稱雄稱霸', 'xưng hùng xưng bá')}
```

```python
from pyhannom import get_latin_word_from_chuhannom

result = get_latin_word_from_chuhannom(
    hannom_word_table,
    "汴𠲅"
)
print(result)
# e.g. {('汴𠲅', 'bin (pin) sạc', '[摱]')}
```

#### Function signature
```python
get_latin_word_from_chuhannom(
    handle,
    input_chuhannom: str,
    normalize_output_case: bool = False
)
```

- **handle**: the word-level table loaded by `load_word_table`.  
- **input_chuhannom** *(str)*: one or more Han-Nom characters forming a string.  
- **normalize_output_case** *(bool, default=False)*: whether to normalize all returned Latin words to lowercase.  
- **returns**: a `set` of tuples. Each tuple contains:  
  1. Han-Nom word (string)  
  2. Corresponding Latin word (string)  
  3. Optional annotation (string, may not always be present)

---

## 📜 License
- **Code**: Licensed under the [MIT License](LICENSE).  
- **Data**: Derived from the *委班復生漢喃越南 Ủy ban Phục sinh Hán Nôm Việt Nam* project. Redistribution or modification of the data must include proper attribution and comply with the requirements set by the original source.  

---

## 🤝 Contributing
- Pull requests are welcome.  
- For major changes, please open an issue first to discuss what you’d like to change.  
- Make sure to update tests as appropriate.

---

## 🌏 Acknowledgments
- This project makes use of data derived from the **"委班復生漢喃越南 Uỷ ban Phục sinh Hán Nôm Việt Nam"** project:  
  - https://www.hannom-rcv.org/BCHNCTD.html  
  - https://www.hannom-rcv.org/Lookup-CHNC.html  
- All rights to the original data belong to the *委班復生漢喃越南 Ủy ban Phục sinh Hán Nôm Việt Nam*.  
- I would like to express my gratitude to the open-source Han-Nom community and the *委班復生漢喃越南 Ủy ban Phục sinh Hán Nôm Việt Nam* project for making these resources available for research and educational purposes.  
- If there is any infringement or concern regarding the use of this data, please contact me immediately. I will respond promptly to resolve the issue, including the possibility of removing this project if necessary.  

---
