API Reference

High-level interface

Yosina - Japanese text transliteration library.

class yosina.TransliteratorRecipe(kanji_old_new: bool = False, replace_suspicious_hyphens_to_prolonged_sound_marks: bool = False, replace_combined_characters: bool = False, replace_circled_or_squared_characters: bool | Literal['exclude-emojis'] = False, replace_ideographic_annotations: bool = False, replace_radicals: bool = False, replace_spaces: bool = False, replace_hyphens: bool | list[Literal['jisx0208_90_windows', 'jisx0201']] = False, replace_mathematical_alphanumerics: bool = False, combine_decomposed_hiraganas_and_katakanas: bool = False, to_fullwidth: bool | Literal['u005c-as-yen-sign'] = False, to_halfwidth: bool | Literal['hankaku-kana'] = False, remove_ivs_svs: bool | Literal['drop-all-selectors'] = False, charset: Literal['unijis_2004', 'adobe_japan1'] = 'unijis_2004')

Configuration recipe for building transliterator chains.

charset: Literal['unijis_2004', 'adobe_japan1'] = 'unijis_2004'

Charset assumed during IVS/SVS transliteration. Default is “unijis_2004”.

combine_decomposed_hiraganas_and_katakanas: bool = False

Combine decomposed hiraganas and katakanas into single counterparts.

Example:

Input: “が” (か + ゙) Output: “が” (single character) Input: “ペ” (ヘ + ゚) Output: “ペ” (single character)

kanji_old_new: bool = False

Replace codepoints that correspond to old-style kanji glyphs (旧字体; kyu-ji-tai) with their modern equivalents (新字体; shin-ji-tai).

Example:

Input: “舊字體の變換” Output: “旧字体の変換”

remove_ivs_svs: bool | Literal['drop-all-selectors'] = False

Replace CJK ideographs followed by IVSes and SVSes with those without selectors based on Adobe-Japan1 character mappings. Specify “drop-all-selectors” to get rid of all selectors from the result.

Example:

Input: “葛󠄀” (葛 + IVS U+E0100) Output: “葛” (without selector) Input: “辻󠄀” (辻 + IVS) Output: “辻”

replace_circled_or_squared_characters: bool | Literal['exclude-emojis'] = False

Replace circled or squared characters with their corresponding templates.

Example:

Input: “①②③” Output: “(1)(2)(3)” Input: “㊙㊗” Output: “(秘)(祝)”

replace_combined_characters: bool = False

Replace combined characters with their corresponding characters.

Example:

Input: “㍻” (single character for Heisei era) Output: “平成” Input: “㈱” Output: “(株)”

replace_hyphens: bool | list[Literal['jisx0208_90_windows', 'jisx0201']] = False

Replace various dash or hyphen symbols with those common in Japanese writing.

Example:

Input: “2019—2020” (em dash) Output: “2019-2020” (hyphen-minus) Input: “A–B” (en dash) Output: “A-B”

replace_ideographic_annotations: bool = False

Replace ideographic annotations used in the traditional method of Chinese-to-Japanese translation devised in ancient Japan.

Example:

Input: “㆖㆘” (ideographic annotations) Output: “上下”

replace_mathematical_alphanumerics: bool = False

Replace mathematical alphanumerics with their plain ASCII equivalents.

Example:

Input: “𝐀𝐁𝐂” (mathematical bold) Output: “ABC” Input: “𝟏𝟐𝟑” (mathematical bold digits) Output: “123”

replace_radicals: bool = False

Replace codepoints for the Kang Xi radicals whose glyphs resemble those of CJK ideographs with the CJK ideograph counterparts.

Example:

Input: “⾔⾨⾷” (Kangxi radicals) Output: “言門食” (CJK ideographs)

replace_spaces: bool = False

Replace various space characters with plain whitespaces or empty strings.

Example:

Input: “A B” (ideographic space U+3000) Output: “A B” (half-width space) Input: “A B” (non-breaking space U+00A0) Output: “A B” (regular space)

replace_suspicious_hyphens_to_prolonged_sound_marks: bool = False

Replace “suspicious” hyphens with prolonged sound marks, and vice versa.

Example:

Input: “データーベース” Output: “データーベース” (no change when followed by ー) Input: “スーパ−” (with hyphen-minus) Output: “スーパー” (becomes prolonged sound mark)

to_fullwidth: bool | Literal['u005c-as-yen-sign'] = False

Replace half-width characters to fullwidth equivalents. Specify “u005c-as-yen-sign” to treat backslash (U+005C) as yen sign in JIS X 0201.

Example:

Input: “ABC123” Output: “ABC123” Input: “カタカナ” Output: “カタカナ”

to_halfwidth: bool | Literal['hankaku-kana'] = False

Replace full-width characters with their half-width equivalents. Specify “hankaku-kana” to handle half-width katakanas too.

Example:

Input: “ABC123” Output: “ABC123” Input: “カタカナ” (with hankaku-kana) Output: “カタカナ”

yosina.make_transliterator(configs_or_recipe: list[tuple[Literal['circled-or-squared', 'combined', 'hira-kata-composition', 'hyphens', 'ideographic-annotations', 'ivs-svs-base', 'jisx0201-and-alike', 'kanji-old-new', 'mathematical-alphanumerics', 'prolonged-sound-marks', 'radicals', 'spaces'], dict[str, Any]] | Literal['circled-or-squared', 'combined', 'hira-kata-composition', 'hyphens', 'ideographic-annotations', 'ivs-svs-base', 'jisx0201-and-alike', 'kanji-old-new', 'mathematical-alphanumerics', 'prolonged-sound-marks', 'radicals', 'spaces']] | TransliteratorRecipe) Callable[[str], str]

Frontend convenience function to create a string-to-string transliterator.

This is the main entry point for the library. It accepts either a recipe or a list of transliterator configs and returns a function that can transliterate strings.

Parameters:

configs_or_recipe – Either a list of TransliteratorConfig/string names or a TransliteratorRecipe object

Returns:

A function that takes a string and returns a transliterated string

Example:

Using a recipe:

>>> from yosina import make_transliterator, TransliteratorRecipe
>>> recipe = TransliteratorRecipe(
...     kanji_old_new=True,
...     replace_spaces=True
... )
>>> transliterator = make_transliterator(recipe)
>>> result = transliterator("some japanese text")

Using configs directly:

>>> configs = [("kanji-old-new", {}), ("spaces", {})]
>>> transliterator = make_transliterator(configs)
>>> result = transliterator("some japanese text")

Character object

Character array building and string conversion utilities.

class yosina.chars.Char(c: str, offset: int, source: Char | None = None)

Represents a character with metadata for transliteration.

c: str

The character string

offset: int

The offset position in the original text

source: Char | None = None

Optional reference to the original character

yosina.chars.build_char_list(input_str: str) list[Char]

Build a list of characters from a string, handling IVS/SVS sequences.

This function properly handles Ideographic Variation Sequences (IVS) and Standardized Variation Sequences (SVS) by combining base characters with their variation selectors into single Char objects.

Parameters:

input_str – The input string to convert to character array

Returns:

A list of Char objects representing the input string, with a sentinel empty character at the end

yosina.chars.from_chars(chars: Iterable[Char]) str

Convert an iterable of characters back to a string.

This function filters out sentinel characters (empty strings) that are used internally by the transliteration system.

Parameters:

chars – An iterable of Char objects

Returns:

A string composed of the non-empty characters