Alphabetic 0.0.7

Description:

alphabetic 0.0.7

Alphabetic
A Python module for retrieving script types of writing systems including alphabets, abjads, abugidas, syllabaries, logographs, featurals as well as Latin script codes.
Description / Background
Alphabetic is a small project that was born out of the need to find out the alphabet of different languages for a private NLP project. Determining the alphabet (or other script types) of a language plays an important role in a variety of NLP tasks and can be used, for example, to classify the language of a given text, normalize it by removing noisy/random strings, apply fine-grained regex pattern matching, and more.
Core functionality in a nutshell: given a specific language Alphabetic, first translates its name internally into a corresponding ISO code (either ISO 639-2/3 or ISO 15924) and outputs the corresponding script, which is categorized according to the writing systems listed in the following table (adapted from here):

Writing system
Each symbol represents
Example

Abjad
Consonant
Arabic alphabet

Abugida
Consonant accompanied by specific vowel modifying symbols represent other vowels
Indian Devanagari

Alphabet
Consonant or vowel
Latin alphabet

Featural system
Distinctive feature of segment
Korean Hangul

Logographic
Word or morpheme as well as syllable
Chinese characters

Syllabary
Syllable
Japanese kana

The distinction between the different script types is important in this respect and necessary in certain application scenarios, as otherwise it can lead to unexpected behavior. Perhaps you have already worked with the built-in string functions in Python? If so, you may have noticed the following questionable result:
print("伏伐休众优伙".isalpha())

# True

The answer True could be interpreted as meaning that the string, which is written in Chinese, is alphabetic. From a linguistic point of view, however, this is incorrect, as there is no alphabet in Chinese (the Chinese writing system is logographic). On the other hand, the following string, which is written in the Devanagari script, is in fact not an alphabet but an abugida:
print("अमित".isalpha())

# False

For this and other use cases Alphabetic can be employed.
Installation
The easiest way to install Alphabetic is to use pip, where you can choose between PyPI and this repository:

pip install alphabetic
pip install git+https://github.com/Halvani/alphabetic.git

The latter will pull and install the latest commit from this repository as well as the required Python dependencies. Note that the repo is updated regulary, while PyPi-packages are less frequently released (primarily after mayor bugfixing, refactoring, etc.).

Usage
A simple lookup of a language's script (e.g., alphabet) can be performed as follows:
from alphabetic import WritingSystem

ws = WritingSystem()
ws.by_language(ws.Language.Hawaiian)

# {"Hawaiian": ["A", "E", "H", "I", "K", "L", "M", "N", "O", "P", "U", "W", "a", "e", "h", "i", "k", "l", "m", "n", "o", "p", "u", "w", "ʻ"]}

By default, the output of by_language is a dictionary containing the name and the corresponding script of the selected language. To retrieve only the latter, use ws.by_language(ws.Language.Hawaiian, as_list=True). However, some languages such as Japanese have not one but multiple writing systems. In such a case, the output would look like this:
ws.by_language(ws.Language.Japanese)

# {"Japanese": {"Hiragana": ["あ", "い", ...], "Kanji": ["万", "丁", ...], "Katakana": ["ア", "イ", ...]}}

In case you want a pretty print of the output, use:
ws.pretty_print(ws.by_language(ws.Language.Dutch))

# A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z

If the resulting script represents an alphabet, the result can be further filtered in terms of:

Letter Casing:

ws.pretty_print(ws.by_language(ws.Language.Bosnian))

# Ђ Ј Љ Њ Ћ Џ А Б В Г Д Е Ж З И К Л М Н О П Р С Т У Ф Х Ц Ч Ш а б в г д е ж з и к л м н о п р с т у ф х ц ч ш ђ ј љ њ ћ џ

ws.pretty_print(ws.by_language(ws.Language.Bosnian, letter_case=ws.LetterCase.Upper))

# Ђ Ј Љ Њ Ћ Џ А Б В Г Д Е Ж З И К Л М Н О П Р С Т У Ф Х Ц Ч Ш

Multigraphs:

ws.pretty_print(ws.by_language(ws.Language.Aleut))

# A B Ch D E F G H Hl Hm Hn Hng I J K L M N Ng O P Q R S T U Uu V W X X̂ Y Z a b ch d e f g h hl hm hn hng i j k l m n ng o p q r s t u uu v w x x̂ y z Á á Ĝ ĝ

ws.pretty_print(ws.by_language(ws.Language.Aleut, strip_multigraphs=True, multigraphs_size=ws.MultigraphSize.All))

# A B D E F G H I J K L M N O P Q R S T U V W X Y Z a b d e f g h i j k l m n o p q r s t u v w x y z Á á Ĝ ĝ

Diacritics

ws.pretty_print(ws.by_language(ws.Language.Czech))

# A B C C h D E F G H I J K L M N O P Q R S T U V W X Y Z a b c c h d e f g h i j k l m n o p q r s t u v w x y z Á É Í Ó Ú Ý á é í ó ú ý Č č Ď ď Ě ě Ň ň Ř ř Š š Ť ť Ů ů Ž ž

ws.pretty_print(ws.by_language(ws.Language.Czech, strip_diacritics=True))

# A B C C h D E F G H I J K L M N O P Q R S T U V W X Y Z a b c c h d e f g h i j k l m n o p q r s t u v w x y z

For certain languages such as Chinese (simplified), which have a language code but no alphabet, a fallback strategy is used which maps the ISO 639-2 language code to an ISO 15924 code (as an example here: "chi" --> "Hans"). As a user, you do not have to handle this manually, but simply call up the language as it is:
ws.pretty_print(ws.by_language(ws.Language.Chinese_Simplified))

# 㑇㑊㕮㘎㙍㙘㙦㛃㛚㛹㟃㠇㠓㤘㥄㧐 ...

Another important use case is to check whether a given sequence of characters represents a specific script of a writing system. This can be achieved as follows:
ws.is_abjad("גדולים או בינוניים") # True
ws.is_alphabet("גדולים או בינוניים") # False

ws.is_alphabet("dobré ráno") # True
ws.is_abjad("dobré ráno") # False

ws.is_logographic("早上好") # True
ws.is_syllabary("早上好") # False

ws.is_abugida("ምልካም እድል") # True
ws.is_abjad("ምልካም እድል") # False

ws.is_featural("좋은 아침") # True
ws.is_logographic("좋은 아침") # False

ws.is_alphabet("დილა მშვიდობისა") # True
ws.is_abjad("დილა მშვიდობისა") # False

Furthermore, you can also use Alphabetic to remove all characters from a given string that do not occur within the supported script types (abjads, abugidas, alphabets, etc.):
ws.strip_non_script_characters("SΓpλrώaσcσhεeςn!", ws.Language.German)
# "Sprachen" (languages)

ws.strip_non_script_characters("SΓpλrώaσcσhεeςn!", ws.Language.Greek)
# "Γλώσσες" (languages)

If no language is given, all characters of all supported script types are considered:
ws.strip_non_script_characters("#jüste BAD/good tösté X4567Y ßÜ משהו действует?!")
# Result: 'jüste BADgood tösté XY ßÜ משהו действует'

If you wish, you can also list the characters of a language based on a specified Unicode range:
ws.generate_all_characters_in_range("\u0400-\u04FF") # Bulgarian

# ['Ѐ', 'Ё', 'Ђ', 'Ѓ', 'Є', ..., 'Ӽ', 'ӽ', 'Ӿ', 'ӿ']

Features

Currently 151 languages and corresponding scripts are supported, with more to follow over time;

In total, Alphabetic covers six writing systems script types: abjads, abugidas, alphabets, syllabaries, logographics as well as featurals;

Beside (true) writing systems, Alphabetic also offers Latin script representations (e.g., Morse or NATO Phonetic Alphabet);

Alphabetic includes a complete list of all ISO 639-1/2/3 as well as ISO 15924 codes and enables bidirectional translation between language names and codes;

At the heart of Alphabetic are json files that can be used independently of the respective programming language or application;

Consistently documented source code.

Supported Languages
Open to view all supported languages

Language
ISO 639-2/3 code

Abkhazian
abk

Afar
aar

Afrikaans
afr

Albanian
sqi

Aleut
ale

Amharic
amh

Angika
anp

Arabic
ara

Arapaho
arp

Armenian
arm

Assamese
asm

Avar
ava

Avestan
ave

Balochi
bal

Bambara
bam

Bashkir
bak

Basque
baq

Bavarian
bar

Belarusian
bel

Bislama
bis

Boko
bqc

Boro
brx

Bosnian
bos

Breton
bre

Bulgarian
bul

Buryat
bua

Catalan
cat

Chamorro
cha

Chechen
che

Cherokee
chr

Chichewa
nya

Chinese_Simplified
chi

Chukchi
ckt

Chuvash
chv

Cimbrian
cim

Cornish
cor

Corsican
cos

Cree
cre

Croatian
hrv

Czech
ces

Danish
dan

Dungan
dng

Dutch
nld

Dzongkha
dzo

Elfdalian
ovd

English
eng

Esperanto
epo

Estonian
est

Ewe
ewe

Faroese
fao

Fijian
fij

Finnish
fin

Flemish
dut

French
fra

Georgian
kat

German
deu

Greek
gre

Guarani
grn

Haitian_Creole
hat

Hausa
hau

Hawaiian
haw

Hebrew
heb

Herero
her

Hindi
hin

Icelandic
isl

Igbo
ibo

Indonesian
ind

Irish
gle

Istro_Romanian
ruo

Italian
ita

Japanese
jpn

Javanese
jav

Jeju
jje

Kabardian
kbd

Kanuri
kau

Kashubian
csb

Kazakh
kaz

Kinyarwanda
kin

Kirghiz
kir

Komi
kpv

Korean
kor

Kumyk
kum

Kurmanji
kmr

Latin
lat

Latvian
lav

Lezghian
lez

Lingala
lin

Lithuanian
lit

Luganda
lug

Luxembourgish
ltz

Macedonian
mkd

Malagasy
mlg

Malay
may

Malayalam
mal

Maltese
mlt

Manx
glv

Maori
mao

Mari
chm

Marshallese
mah

Moksha
mdf

Moldovan
rum

Mongolian
mon

Mru
mro

Nepali
nep

Norwegian
nor

Occitan
oci

Oromo
orm

Osage
osa

Parthian
xpr

Pashto
pus

Persian
per

Phoenician
phn

Polish
pol

Portuguese
por

Punjabi_Gurmukhī
_pan

Punjabi_Shahmukhi
pan

Quechua
que

Rohingya
rhg

Russian
rus

Samaritan
smp

Samoan
smo

Sango
sag

Sanskrit
san

Scottish_Gaelic
gla

Serbian
srp

Slovak
slo

Slovenian
slv

Somali
som

Sorani
ckb

Spanish
spa

Sundanese
sun

Swedish
swe

Swiss_German
gsw

Tajik
tgk

Tatar
tat

Turkish
tur

Turkmen
tuk

Tuvan
tyv

Twi
twi

Ugaritic
uga

Ukrainian
ukr

Uzbek
uzb

Venda
ven

Vengo
bav

Volapük
vol

Welsh
wel

Wolof
wol

Yakut
sah

Yiddish
yid

Zeeuws
zea

Zulu
zul

Supported Abjads
Open to view all supported abjads

Abjad
ISO code

Arabic
Arab

Balochi
bal

Hebrew
Hebr

Hebrew_Samaritan
Samr

Parthian
Prti

Pashto
pus

Persian
per

Phoenician
Phnx

Punjabi_Shahmukhi
pan

Sorani
ckb

Ugaritic
Ugar

Yiddish
yid

Supported Abugidas
Open to view all supported abugidas

Abugida
ISO code

Amharic
amh

Angika
anp

Assamese
asm

Boro
brx

Devanagari
Deva

Dzongkha
dzo

Ethiopic
Ethi

Hindi
hin

Malayalam
Mlym

Nepali
nep

Punjabi_Gurmukhī
Guru

Sanskrit
san

Sundanese
Sund

Thaana
Thaa

Supported Syllabaries
Open to view all supported syllabaries

Syllabary
ISO code

Avestan
Avst

Carian
Cari

Cherokee
Cher

Hiragana
Hira

Katakana
Kana

Lydian
Lydi

Supported Logographics
Open to view all supported logographics

Logographic
ISO code

Chinese_Simplified
Hans

Kanji
Hani

Supported featural writing systems
Open to view all supported featurals

Featural
ISO code

Hangul
Hang

Design Considerations / Limitations
Once delving deeper into the world of writing systems, one is overwhelmed by the numerous difficulties that arise when organizing the various script types. This is particularly difficult when it comes to non-Latin scripts with their many variabilities and forms. Therefore, various design considerations were made to make Alphabetic as simple and usable as possible.

For languages that exhibit several variants of alphabets, the more modern or the most frequently encountered form was used. References to sources such as Omniglot, Wikipedia and Britannica were used for this purpose.

For Arabic scripts where the character form depends on its position, the so-called isolated forms were used.

Multigraphs are considered as part of the scripts. However, if desired they can be suppressed. The same applies to diacritical marks (e.g., acute, breve, cédille, gravis, etc.).

The function is_abugida is not fully functional because not all vowel forms are integrated yet.

For so-called non-bicameral languages such as Hebrew or Arabic, where there is no distinction between upper and lower case, the respective filter letter_case= argument is ignored and the entire alphabet is returned instead:

ws.pretty_print(ws.by_language(ws.Language.Hebrew, letter_case=ws.LetterCase.Upper))

# א ב ג ד ה ו ז ח ט י כ ך ל מ ם נ ן ס ע פ ף צ ץ ק ר ש ת

ws.pretty_print(ws.by_language(ws.Language.Arabic, letter_case=ws.LetterCase.Lower))

# ا ب ة ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي

Contribution
If you like this project, you are welcome to support it, e.g. by testing it or providing additional languages (there is a lot to do with regard to the remaining languages). Feel free to fork the repository and create a pull request to suggest and collaborate on changes.
Disclaimer
Although this project has been carried out with great care, no liability is accepted for the completeness and accuracy of all the underlying data. The use of Alphabetic for integration into production systems is at your own risk!
Furthermore, please note that this project is still in its initial phase. The code structure may therefore change over time.
Citation
If you find this repository helpful, feel free to cite it in your paper or project:
@software{Halvani_Constituent_Treelib:2024,
author = {Halvani, Oren},
title = {{Alphabetic - A Python module for retrieving script types of writing systems including alphabets, abjads, abugidas, syllabaries, logographs, featurals as well as Latin script codes}},
doi = {https://doi.org/10.5281/zenodo.11580510},
month = jun,
url = {https://github.com/Halvani/alphabetic},
version = {0.0.5},
year = {2024}
}

License
The Alphabetic package is released under the Apache-2.0 license. See LICENSE for further details.
Last Remarks
As is usual with open source projects, we developers do not earn any money with what we do, but are primarily interested in giving something back to the community with fun, passion and joy. Nevertheless, we would be very happy if you rewarded all the time that has gone into the project with just a small star 🤗

Overview

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

You're allowed to use the code bits in the repositories in unlimited projects.
Attribution is not required to use the code bits.

What you can do with it

Use them freely in your personal and professional work.

What you can't do with it

Don't be greedy. Selling or distributing these repositories in their original state is prohibited.

zed

Languages

Categories

Description:

License

Share

Overview

What you can do with it

What you can't do with it

Related Products

Views For YouTube Bot writed on Python

AI-Web-Scraper

quivr

roop

More From This Creator

xdict 1.1.11

xdisplayselect 1.0.0

xfcs 1.1.6

xfcsdashboard 0.0.2

xfds 0.3.0

alphabetic 0.0.7

Languages

Categories

Description:

License

Share

Customer Reviews

License

Overview

What you can do with it

What you can't do with it

Related Products

Views For YouTube Bot writed on Python

AI-Web-Scraper

quivr

roop

zed

More From This Creator

xdict 1.1.11

xdisplayselect 1.0.0

xfcs 1.1.6

xfcsdashboard 0.0.2

xfds 0.3.0