0 purchases
texthr 0.20
Morphological/Inflection/Lemmatization Engine for Croatian language
“text-hr” is Morphological/Inflectional/Lemmatization Engine for Croatian
language written in Python programming language. Includes stopwords and
Part-Of-Speech tagging engine (POS tagging) based on inverse inflection
algorithm for detection.
Since API is not freezed, this project is still in alpha.
TAGS
Croatian language, lemmatization, stemming, inflection, python, natural
language processing (NLP), Part-of-speech (POS) tagging, stopwords, inverse
inflection, morphological lexicon
OZNAKE
Hrvatski jezik, lematizacija, Python biblioteka, morfologija, infleksija,
obrnuta infleksija, prepoznavanje vrsta riječi, računalna obrada govornog
jezika, zaustavne riječi, morfološki leksikon
AUTHOR
Robert Lujo, Zagreb, Croatia, find mail address in LICENCE
FEATURES
To name the most important:
inflection system - for producing all forms of one word
detection of word types (POS tagging) - from existing list of word forms
list of stopwords
System is based on unicode strings, default codepage to convert from and to
string is cp-1250.
Check Getting started.
INSTALLATION
Installation instructions - if you have installed pip package
http://pypi.python.org/pypi/pip:
pip install text-hr
If not, then do it old-fashioned way:
download zip from http://pypi.python.org/pypi/text-hr/
unzip
open shell
go to distribution directory
python setup.py install
GETTING STARTED
There are three important parts that this project provides:
Inflection system - for producing all forms of one word
Detection of word types (POS tagging) - from existing list of word forms
List of stopwords
Inflection system
Usage example - start python shell:
>>> from text_hr import Verb
>>> v = Verb("platiti")
>>> for k in sorted(v.forms.keys()):
... print(k, v.forms[k])
...
AOR/P/1 [u'platismo']
AOR/P/2 [u'platiste']
AOR/P/3 [u'plati\u0161e']
AOR/S/1 [u'platih']
AOR/S/2 [u'plati']
AOR/S/3 [u'plati']
IMP/P/1 [u'platasmo', u'pla\u0107asmo', u'platijasmo']
IMP/P/2 [u'plataste', u'pla\u0107aste', u'platijaste']
IMP/P/3 [u'platahu', u'pla\u0107ahu', u'platijahu']
...
VA_PA//P_O+S+V+N [u'pla\u0107eno']
X_INF// [u'platiti']
X_VAD_PAS// [u'plativ\u0161i']
X_VAD_PRE// [u'plate\u0107i']
X_VAD_PRE// [u'plate\u0107i']
Detection of word types (POS tagging)
TODO: to be done - check test_detect.txt for samples, and detect.py for the logic:
First example in test_detect.txt:
>>> from text_hr.detect import WordTypeRecognizerExample
>>> def test_it(word_list, wt_filter=None, level=2):
... wdh = WordTypeRecognizerExample(word_list, silent=True)
... if not wt_filter is None:
... wdh.detect(wt_filter=wt_filter, level=level) # e.g. wt_filter=["N"]
... else:
... wdh.detect(level=level) # all word types
... lines_file = LinesFile()
... wdh.dump_result(lines_file) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
... print("\n".join(lines_file.lines))
... return wdh
>>> class LinesFile(object):
... def __init__(self):
... self.lines = []
... def write(self, s):
... self.lines.append(repr(s.rstrip()))
>>> word_list = [
... "Broj 84"
... , "broji 34"
... , "Brojila 28"
... , "broje 23"
... , "brojeći 22"
... , "brojim 7"
... , "brojimo 5"
... , "brojiš 4"
... , "brojahu 2"
... , "brojaše 1"
... , "brojite 1"
... , "-brijestovu 1"
... , "brijestovi 1" #the only one checked with endswith, but all other will be checked with get_freq
... , "-brijestove 1"
... , "-brijestova 1"
... ]
Lowest quality, but fastest
>>> wdh = test_it(word_list, level=4) # doctest: +ELLIPSIS
" 10/ 183 -> brojati (u'V-XX_-_JATI-je\\u0107i-0') 84/broj,34/broji,23/broje,22/broje\xe6i,7/brojim,5/brojimo,4/broji\x9a,2/brojahu,1/brojite,1/broja\x9ae"
List of stopwords
Is located in std_words.txt, and you can read it directly from here
http://bitbucket.org/trebor74hr/text-hr/src/tip/text_hr/std_words.txt
The list can be updated like this:
>>> import text_hr
>>> text_hr.dump_all_std_words()
Totaly 2904 word forms dumped to r:\hg-clones\python\text-hr\text_hr\std_words.txt in codepage utf8
Iteration over all words goes like this:
from text_hr import get_all_std_words
for word_base, l_key, cnt, _suff_id, wform_key, wform in get_all_std_words():
print(word_base, l_key, cnt, _suff_id, wform_key, wform)
Further
Since there is currently no good documentation, the best source of
further information is by reading tests inside of modules and
tests in tests directory (dev version). More information in Running tests.
You can allways read a source.
DOCUMENTATION
Currently there is no documentation. In progress …
SUPPORT
Since this project is limited by my free time, support is limited.
REPORT BUG OR REQUEST FEATURE
If you encounter bug, the best is to report it to the bitbucket web page
http://bitbucket.org/trebor74hr/text-hr.
If there will be an interest for development for other inflection rich
languages, I’d be glad to decouple language specific code and create new
project that will be capable to deal with multiple languages.
The best way to contact me is by mail (find in LICENCE).
TODO list is in readme.txt (dev version).
CONTRIBUTION
Since this project is not currently in the stable API phase, contribution
should wait for a while.
RUNNING TESTS
All tests are doctests (not unittests). There are three type of tests in the
package:
doctests in each module - e.g. in verbs.py
doctests in tests/test_*.txt - only development version
tests which are not automatically compared - i.e. in special call mode
detect.py can produce output file which needs to be compared
manually with some existing file. Such test(s) are very slow. This needs
to be changed to be automatic.
Running each module directly will run 1. and 2. if running from development
version. To get development version
To use development version (http://bitbucket.org/trebor74hr/text-hr):
hg clone https://bitbucket.org/trebor74hr/text-hr
create text_hr.pth in python site-packages directory with path to text-hr e.g.:
r:\hg-clones\python\text-hr
To run all tests:
go to tests directory
run tests.py like (with sample output):
> python tests.py
testing module __init__
testing module adjectives
...
testing textfile R:\hg-clones\python\text-hr\tests\test_adj.txt
...
testing textfile R:\hg-clones\python\text-hr\tests\test_verbs_type.txt
To run tests for just one module:
goto text_hr directory
run tests by running module, e.g.:
> py pronouns.py
__main__: running doctests
..\tests\test_pronouns.txt: running doctests
in the case you’re not running from dev version, you’ll get output like
this:
> py pronouns.py
__main__: running doctests
..\tests\test_pronouns.txt: Not found, skipping
ADDITIONAL
Master thesis pdf in Croatian (134 pages) with title:
Lociranje sličnih logičkih cjelina u tekstualnim
dokumentima na hrvatskome jeziku
can be found at:
http://bitbucket.org/trebor74hr/text-hr/downloads/magistarski-konacni.pdf
TODO
various things, see readme.txt for details.
CHANGES
0.20
RL 200507
migration to python 3+, tested on python 3.7, all tests pass
0.18
RL 121210
fixed wrong readme on bitbucket homepage
0.17
RL 100617
utf-8 setup
0.16
RL 100617
master thesis pdf added to repository (in Croatian, 134 pages)
0.15
RL 100617
minor changes
0.14
RL 100617
beta release
tags: lemmatization, stemming
0.13
RL 100610:
text_hr package reorganized (__init__.py with __all__ and imports …)
word_types.py removed
std_words.txt
0.12
RL 100608 :
README
enabled tests from tests.py for all
enabled tests from directly from each modules
0.11
RL 100607:
recreated repo at bitbucket
no .suff_registry.pickle and testing_*.out put in zip
0.10
RL 100605:
first installable release
For personal and professional use. You cannot resell or redistribute these repositories in their original state.
There are no reviews.