API Reference¶
High-level interface¶
Yosina - Japanese text transliteration library.
- class yosina.TransliteratorRecipe(kanji_old_new: bool = False, replace_suspicious_hyphens_to_prolonged_sound_marks: bool = False, replace_combined_characters: bool = False, replace_circled_or_squared_characters: bool | Literal['exclude-emojis'] = False, replace_ideographic_annotations: bool = False, replace_radicals: bool = False, replace_spaces: bool = False, replace_hyphens: bool | list[Literal['jisx0208_90_windows', 'jisx0201']] = False, replace_mathematical_alphanumerics: bool = False, combine_decomposed_hiraganas_and_katakanas: bool = False, to_fullwidth: bool | Literal['u005c-as-yen-sign'] = False, to_halfwidth: bool | Literal['hankaku-kana'] = False, remove_ivs_svs: bool | Literal['drop-all-selectors'] = False, charset: Literal['unijis_2004', 'adobe_japan1'] = 'unijis_2004')¶
Configuration recipe for building transliterator chains.
- charset: Literal['unijis_2004', 'adobe_japan1'] = 'unijis_2004'¶
Charset assumed during IVS/SVS transliteration. Default is “unijis_2004”.
- combine_decomposed_hiraganas_and_katakanas: bool = False¶
Combine decomposed hiraganas and katakanas into single counterparts.
- Example:
Input: “が” (か + ゙) Output: “が” (single character) Input: “ペ” (ヘ + ゚) Output: “ペ” (single character)
- kanji_old_new: bool = False¶
Replace codepoints that correspond to old-style kanji glyphs (旧字体; kyu-ji-tai) with their modern equivalents (新字体; shin-ji-tai).
- Example:
Input: “舊字體の變換” Output: “旧字体の変換”
- remove_ivs_svs: bool | Literal['drop-all-selectors'] = False¶
Replace CJK ideographs followed by IVSes and SVSes with those without selectors based on Adobe-Japan1 character mappings. Specify “drop-all-selectors” to get rid of all selectors from the result.
- Example:
Input: “葛󠄀” (葛 + IVS U+E0100) Output: “葛” (without selector) Input: “辻󠄀” (辻 + IVS) Output: “辻”
- replace_circled_or_squared_characters: bool | Literal['exclude-emojis'] = False¶
Replace circled or squared characters with their corresponding templates.
- Example:
Input: “①②③” Output: “(1)(2)(3)” Input: “㊙㊗” Output: “(秘)(祝)”
- replace_combined_characters: bool = False¶
Replace combined characters with their corresponding characters.
- Example:
Input: “㍻” (single character for Heisei era) Output: “平成” Input: “㈱” Output: “(株)”
- replace_hyphens: bool | list[Literal['jisx0208_90_windows', 'jisx0201']] = False¶
Replace various dash or hyphen symbols with those common in Japanese writing.
- Example:
Input: “2019—2020” (em dash) Output: “2019-2020” (hyphen-minus) Input: “A–B” (en dash) Output: “A-B”
- replace_ideographic_annotations: bool = False¶
Replace ideographic annotations used in the traditional method of Chinese-to-Japanese translation devised in ancient Japan.
- Example:
Input: “㆖㆘” (ideographic annotations) Output: “上下”
- replace_mathematical_alphanumerics: bool = False¶
Replace mathematical alphanumerics with their plain ASCII equivalents.
- Example:
Input: “𝐀𝐁𝐂” (mathematical bold) Output: “ABC” Input: “𝟏𝟐𝟑” (mathematical bold digits) Output: “123”
- replace_radicals: bool = False¶
Replace codepoints for the Kang Xi radicals whose glyphs resemble those of CJK ideographs with the CJK ideograph counterparts.
- Example:
Input: “⾔⾨⾷” (Kangxi radicals) Output: “言門食” (CJK ideographs)
- replace_spaces: bool = False¶
Replace various space characters with plain whitespaces or empty strings.
- Example:
Input: “A B” (ideographic space U+3000) Output: “A B” (half-width space) Input: “A B” (non-breaking space U+00A0) Output: “A B” (regular space)
- replace_suspicious_hyphens_to_prolonged_sound_marks: bool = False¶
Replace “suspicious” hyphens with prolonged sound marks, and vice versa.
- Example:
Input: “データーベース” Output: “データーベース” (no change when followed by ー) Input: “スーパ−” (with hyphen-minus) Output: “スーパー” (becomes prolonged sound mark)
- to_fullwidth: bool | Literal['u005c-as-yen-sign'] = False¶
Replace half-width characters to fullwidth equivalents. Specify “u005c-as-yen-sign” to treat backslash (U+005C) as yen sign in JIS X 0201.
- Example:
Input: “ABC123” Output: “ABC123” Input: “カタカナ” Output: “カタカナ”
- to_halfwidth: bool | Literal['hankaku-kana'] = False¶
Replace full-width characters with their half-width equivalents. Specify “hankaku-kana” to handle half-width katakanas too.
- Example:
Input: “ABC123” Output: “ABC123” Input: “カタカナ” (with hankaku-kana) Output: “カタカナ”
- yosina.make_transliterator(configs_or_recipe: list[tuple[Literal['circled-or-squared', 'combined', 'hira-kata-composition', 'hyphens', 'ideographic-annotations', 'ivs-svs-base', 'jisx0201-and-alike', 'kanji-old-new', 'mathematical-alphanumerics', 'prolonged-sound-marks', 'radicals', 'spaces'], dict[str, Any]] | Literal['circled-or-squared', 'combined', 'hira-kata-composition', 'hyphens', 'ideographic-annotations', 'ivs-svs-base', 'jisx0201-and-alike', 'kanji-old-new', 'mathematical-alphanumerics', 'prolonged-sound-marks', 'radicals', 'spaces']] | TransliteratorRecipe) Callable[[str], str]¶
Frontend convenience function to create a string-to-string transliterator.
This is the main entry point for the library. It accepts either a recipe or a list of transliterator configs and returns a function that can transliterate strings.
- Parameters:
configs_or_recipe – Either a list of TransliteratorConfig/string names or a TransliteratorRecipe object
- Returns:
A function that takes a string and returns a transliterated string
- Example:
Using a recipe:
>>> from yosina import make_transliterator, TransliteratorRecipe >>> recipe = TransliteratorRecipe( ... kanji_old_new=True, ... replace_spaces=True ... ) >>> transliterator = make_transliterator(recipe) >>> result = transliterator("some japanese text")
Using configs directly:
>>> configs = [("kanji-old-new", {}), ("spaces", {})] >>> transliterator = make_transliterator(configs) >>> result = transliterator("some japanese text")
Character object¶
Character array building and string conversion utilities.
- class yosina.chars.Char(c: str, offset: int, source: Char | None = None)¶
Represents a character with metadata for transliteration.
- c: str¶
The character string
- offset: int¶
The offset position in the original text
- yosina.chars.build_char_list(input_str: str) list[Char]¶
Build a list of characters from a string, handling IVS/SVS sequences.
This function properly handles Ideographic Variation Sequences (IVS) and Standardized Variation Sequences (SVS) by combining base characters with their variation selectors into single Char objects.
- Parameters:
input_str – The input string to convert to character array
- Returns:
A list of Char objects representing the input string, with a sentinel empty character at the end
- yosina.chars.from_chars(chars: Iterable[Char]) str¶
Convert an iterable of characters back to a string.
This function filters out sentinel characters (empty strings) that are used internally by the transliteration system.
- Parameters:
chars – An iterable of Char objects
- Returns:
A string composed of the non-empty characters