alphabetic 0.0.7

Creator: bradpython12

Last updated:

0 purchases

TODO
Add to Cart

Description:

alphabetic 0.0.7

Alphabetic
A Python module for retrieving script types of writing systems including alphabets, abjads, abugidas, syllabaries, logographs, featurals as well as Latin script codes.
Description / Background
Alphabetic is a small project that was born out of the need to find out the alphabet of different languages for a private NLP project. Determining the alphabet (or other script types) of a language plays an important role in a variety of NLP tasks and can be used, for example, to classify the language of a given text, normalize it by removing noisy/random strings, apply fine-grained regex pattern matching, and more.
Core functionality in a nutshell: given a specific language Alphabetic, first translates its name internally into a corresponding ISO code (either ISO 639-2/3 or ISO 15924) and outputs the corresponding script, which is categorized according to the writing systems listed in the following table (adapted from here):



Writing system
Each symbol represents
Example




Abjad
Consonant
Arabic alphabet


Abugida
Consonant accompanied by specific vowel modifying symbols represent other vowels
Indian Devanagari


Alphabet
Consonant or vowel
Latin alphabet


Featural system
Distinctive feature of segment
Korean Hangul


Logographic
Word or morpheme as well as syllable
Chinese characters


Syllabary
Syllable
Japanese kana



The distinction between the different script types is important in this respect and necessary in certain application scenarios, as otherwise it can lead to unexpected behavior. Perhaps you have already worked with the built-in string functions in Python? If so, you may have noticed the following questionable result:
print("伏伐休众优伙".isalpha())

# True

The answer True could be interpreted as meaning that the string, which is written in Chinese, is alphabetic. From a linguistic point of view, however, this is incorrect, as there is no alphabet in Chinese (the Chinese writing system is logographic). On the other hand, the following string, which is written in the Devanagari script, is in fact not an alphabet but an abugida:
print("अमित".isalpha())

# False

For this and other use cases Alphabetic can be employed.
Installation
The easiest way to install Alphabetic is to use pip, where you can choose between PyPI and this repository:

pip install alphabetic
pip install git+https://github.com/Halvani/alphabetic.git

The latter will pull and install the latest commit from this repository as well as the required Python dependencies. Note that the repo is updated regulary, while PyPi-packages are less frequently released (primarily after mayor bugfixing, refactoring, etc.).

Usage
A simple lookup of a language's script (e.g., alphabet) can be performed as follows:
from alphabetic import WritingSystem

ws = WritingSystem()
ws.by_language(ws.Language.Hawaiian)

# {"Hawaiian": ["A", "E", "H", "I", "K", "L", "M", "N", "O", "P", "U", "W", "a", "e", "h", "i", "k", "l", "m", "n", "o", "p", "u", "w", "ʻ"]}

By default, the output of by_language is a dictionary containing the name and the corresponding script of the selected language. To retrieve only the latter, use ws.by_language(ws.Language.Hawaiian, as_list=True). However, some languages such as Japanese have not one but multiple writing systems. In such a case, the output would look like this:
ws.by_language(ws.Language.Japanese)

# {"Japanese": {"Hiragana": ["あ", "い", ...], "Kanji": ["万", "丁", ...], "Katakana": ["ア", "イ", ...]}}

In case you want a pretty print of the output, use:
ws.pretty_print(ws.by_language(ws.Language.Dutch))

# A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z

If the resulting script represents an alphabet, the result can be further filtered in terms of:

Letter Casing:

ws.pretty_print(ws.by_language(ws.Language.Bosnian))

# Ђ Ј Љ Њ Ћ Џ А Б В Г Д Е Ж З И К Л М Н О П Р С Т У Ф Х Ц Ч Ш а б в г д е ж з и к л м н о п р с т у ф х ц ч ш ђ ј љ њ ћ џ

ws.pretty_print(ws.by_language(ws.Language.Bosnian, letter_case=ws.LetterCase.Upper))

# Ђ Ј Љ Њ Ћ Џ А Б В Г Д Е Ж З И К Л М Н О П Р С Т У Ф Х Ц Ч Ш


Multigraphs:

ws.pretty_print(ws.by_language(ws.Language.Aleut))

# A B Ch D E F G H Hl Hm Hn Hng I J K L M N Ng O P Q R S T U Uu V W X X̂ Y Z a b ch d e f g h hl hm hn hng i j k l m n ng o p q r s t u uu v w x x̂ y z Á á Ĝ ĝ

ws.pretty_print(ws.by_language(ws.Language.Aleut, strip_multigraphs=True, multigraphs_size=ws.MultigraphSize.All))

# A B D E F G H I J K L M N O P Q R S T U V W X Y Z a b d e f g h i j k l m n o p q r s t u v w x y z Á á Ĝ ĝ


Diacritics

ws.pretty_print(ws.by_language(ws.Language.Czech))

# A B C C h D E F G H I J K L M N O P Q R S T U V W X Y Z a b c c h d e f g h i j k l m n o p q r s t u v w x y z Á É Í Ó Ú Ý á é í ó ú ý Č č Ď ď Ě ě Ň ň Ř ř Š š Ť ť Ů ů Ž ž

ws.pretty_print(ws.by_language(ws.Language.Czech, strip_diacritics=True))

# A B C C h D E F G H I J K L M N O P Q R S T U V W X Y Z a b c c h d e f g h i j k l m n o p q r s t u v w x y z

For certain languages such as Chinese (simplified), which have a language code but no alphabet, a fallback strategy is used which maps the ISO 639-2 language code to an ISO 15924 code (as an example here: "chi" --> "Hans"). As a user, you do not have to handle this manually, but simply call up the language as it is:
ws.pretty_print(ws.by_language(ws.Language.Chinese_Simplified))

# 㑇 㑊 㕮 㘎 㙍 㙘 㙦 㛃 㛚 㛹 㟃 㠇 㠓 㤘 㥄 㧐 ...

Another important use case is to check whether a given sequence of characters represents a specific script of a writing system. This can be achieved as follows:
ws.is_abjad("גדולים או בינוניים") # True
ws.is_alphabet("גדולים או בינוניים") # False

ws.is_alphabet("dobré ráno") # True
ws.is_abjad("dobré ráno") # False

ws.is_logographic("早上好") # True
ws.is_syllabary("早上好") # False

ws.is_abugida("ምልካም እድል") # True
ws.is_abjad("ምልካም እድል") # False

ws.is_featural("좋은 아침") # True
ws.is_logographic("좋은 아침") # False

ws.is_alphabet("დილა მშვიდობისა") # True
ws.is_abjad("დილა მშვიდობისა") # False

Furthermore, you can also use Alphabetic to remove all characters from a given string that do not occur within the supported script types (abjads, abugidas, alphabets, etc.):
ws.strip_non_script_characters("SΓpλrώaσcσhεeςn!", ws.Language.German)
# "Sprachen" (languages)

ws.strip_non_script_characters("SΓpλrώaσcσhεeςn!", ws.Language.Greek)
# "Γλώσσες" (languages)

If no language is given, all characters of all supported script types are considered:
ws.strip_non_script_characters("#jüste BAD/good tösté X4567Y ßÜ משהו действует?!")
# Result: 'jüste BADgood tösté XY ßÜ משהו действует'

If you wish, you can also list the characters of a language based on a specified Unicode range:
ws.generate_all_characters_in_range("\u0400-\u04FF") # Bulgarian

# ['Ѐ', 'Ё', 'Ђ', 'Ѓ', 'Є', ..., 'Ӽ', 'ӽ', 'Ӿ', 'ӿ']

Features


Currently 151 languages and corresponding scripts are supported, with more to follow over time;


In total, Alphabetic covers six writing systems script types: abjads, abugidas, alphabets, syllabaries, logographics as well as featurals;


Beside (true) writing systems, Alphabetic also offers Latin script representations (e.g., Morse or NATO Phonetic Alphabet);


Alphabetic includes a complete list of all ISO 639-1/2/3 as well as ISO 15924 codes and enables bidirectional translation between language names and codes;


At the heart of Alphabetic are json files that can be used independently of the respective programming language or application;


Consistently documented source code.



Supported Languages
Open to view all supported languages



Language
ISO 639-2/3 code




Abkhazian
abk


Afar
aar


Afrikaans
afr


Albanian
sqi


Aleut
ale


Amharic
amh


Angika
anp


Arabic
ara


Arapaho
arp


Armenian
arm


Assamese
asm


Avar
ava


Avestan
ave


Balochi
bal


Bambara
bam


Bashkir
bak


Basque
baq


Bavarian
bar


Belarusian
bel


Bislama
bis


Boko
bqc


Boro
brx


Bosnian
bos


Breton
bre


Bulgarian
bul


Buryat
bua


Catalan
cat


Chamorro
cha


Chechen
che


Cherokee
chr


Chichewa
nya


Chinese_Simplified
chi


Chukchi
ckt


Chuvash
chv


Cimbrian
cim


Cornish
cor


Corsican
cos


Cree
cre


Croatian
hrv


Czech
ces


Danish
dan


Dungan
dng


Dutch
nld


Dzongkha
dzo


Elfdalian
ovd


English
eng


Esperanto
epo


Estonian
est


Ewe
ewe


Faroese
fao


Fijian
fij


Finnish
fin


Flemish
dut


French
fra


Georgian
kat


German
deu


Greek
gre


Guarani
grn


Haitian_Creole
hat


Hausa
hau


Hawaiian
haw


Hebrew
heb


Herero
her


Hindi
hin


Icelandic
isl


Igbo
ibo


Indonesian
ind


Irish
gle


Istro_Romanian
ruo


Italian
ita


Japanese
jpn


Javanese
jav


Jeju
jje


Kabardian
kbd


Kanuri
kau


Kashubian
csb


Kazakh
kaz


Kinyarwanda
kin


Kirghiz
kir


Komi
kpv


Korean
kor


Kumyk
kum


Kurmanji
kmr


Latin
lat


Latvian
lav


Lezghian
lez


Lingala
lin


Lithuanian
lit


Luganda
lug


Luxembourgish
ltz


Macedonian
mkd


Malagasy
mlg


Malay
may


Malayalam
mal


Maltese
mlt


Manx
glv


Maori
mao


Mari
chm


Marshallese
mah


Moksha
mdf


Moldovan
rum


Mongolian
mon


Mru
mro


Nepali
nep


Norwegian
nor


Occitan
oci


Oromo
orm


Osage
osa


Parthian
xpr


Pashto
pus


Persian
per


Phoenician
phn


Polish
pol


Portuguese
por


Punjabi_Gurmukhī
_pan


Punjabi_Shahmukhi
pan


Quechua
que


Rohingya
rhg


Russian
rus


Samaritan
smp


Samoan
smo


Sango
sag


Sanskrit
san


Scottish_Gaelic
gla


Serbian
srp


Slovak
slo


Slovenian
slv


Somali
som


Sorani
ckb


Spanish
spa


Sundanese
sun


Swedish
swe


Swiss_German
gsw


Tajik
tgk


Tatar
tat


Turkish
tur


Turkmen
tuk


Tuvan
tyv


Twi
twi


Ugaritic
uga


Ukrainian
ukr


Uzbek
uzb


Venda
ven


Vengo
bav


Volapük
vol


Welsh
wel


Wolof
wol


Yakut
sah


Yiddish
yid


Zeeuws
zea


Zulu
zul





Supported Abjads
Open to view all supported abjads



Abjad
ISO code




Arabic
Arab


Balochi
bal


Hebrew
Hebr


Hebrew_Samaritan
Samr


Parthian
Prti


Pashto
pus


Persian
per


Phoenician
Phnx


Punjabi_Shahmukhi
pan


Sorani
ckb


Ugaritic
Ugar


Yiddish
yid





Supported Abugidas
Open to view all supported abugidas



Abugida
ISO code




Amharic
amh


Angika
anp


Assamese
asm


Boro
brx


Devanagari
Deva


Dzongkha
dzo


Ethiopic
Ethi


Hindi
hin


Malayalam
Mlym


Nepali
nep


Punjabi_Gurmukhī
Guru


Sanskrit
san


Sundanese
Sund


Thaana
Thaa





Supported Syllabaries
Open to view all supported syllabaries



Syllabary
ISO code




Avestan
Avst


Carian
Cari


Cherokee
Cher


Hiragana
Hira


Katakana
Kana


Lydian
Lydi





Supported Logographics
Open to view all supported logographics



Logographic
ISO code




Chinese_Simplified
Hans


Kanji
Hani





Supported featural writing systems
Open to view all supported featurals



Featural
ISO code




Hangul
Hang





Design Considerations / Limitations
Once delving deeper into the world of writing systems, one is overwhelmed by the numerous difficulties that arise when organizing the various script types. This is particularly difficult when it comes to non-Latin scripts with their many variabilities and forms. Therefore, various design considerations were made to make Alphabetic as simple and usable as possible.


For languages that exhibit several variants of alphabets, the more modern or the most frequently encountered form was used. References to sources such as Omniglot, Wikipedia and Britannica were used for this purpose.


For Arabic scripts where the character form depends on its position, the so-called isolated forms were used.


Multigraphs are considered as part of the scripts. However, if desired they can be suppressed. The same applies to diacritical marks (e.g., acute, breve, cédille, gravis, etc.).


The function is_abugida is not fully functional because not all vowel forms are integrated yet.


For so-called non-bicameral languages such as Hebrew or Arabic, where there is no distinction between upper and lower case, the respective filter letter_case= argument is ignored and the entire alphabet is returned instead:


ws.pretty_print(ws.by_language(ws.Language.Hebrew, letter_case=ws.LetterCase.Upper))

# א ב ג ד ה ו ז ח ט י כ ך ל מ ם נ ן ס ע פ ף צ ץ ק ר ש ת

ws.pretty_print(ws.by_language(ws.Language.Arabic, letter_case=ws.LetterCase.Lower))

# ا ب ة ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي

Contribution
If you like this project, you are welcome to support it, e.g. by testing it or providing additional languages (there is a lot to do with regard to the remaining languages). Feel free to fork the repository and create a pull request to suggest and collaborate on changes.
Disclaimer
Although this project has been carried out with great care, no liability is accepted for the completeness and accuracy of all the underlying data. The use of Alphabetic for integration into production systems is at your own risk!
Furthermore, please note that this project is still in its initial phase. The code structure may therefore change over time.
Citation
If you find this repository helpful, feel free to cite it in your paper or project:
@software{Halvani_Constituent_Treelib:2024,
author = {Halvani, Oren},
title = {{Alphabetic - A Python module for retrieving script types of writing systems including alphabets, abjads, abugidas, syllabaries, logographs, featurals as well as Latin script codes}},
doi = {https://doi.org/10.5281/zenodo.11580510},
month = jun,
url = {https://github.com/Halvani/alphabetic},
version = {0.0.5},
year = {2024}
}

License
The Alphabetic package is released under the Apache-2.0 license. See LICENSE for further details.
Last Remarks
As is usual with open source projects, we developers do not earn any money with what we do, but are primarily interested in giving something back to the community with fun, passion and joy. Nevertheless, we would be very happy if you rewarded all the time that has gone into the project with just a small star 🤗

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.