shoten 0.1.0

Creator: bradpython12

Last updated:

0 purchases

shoten 0.1.0 Image
shoten 0.1.0 Images

Languages

Categories

Add to Cart

Description:

shoten 0.1.0

Helper functions to find word trends (i.e. extract tokens, lemmatize and filter).

Installation
pip/pip3 install -U git+https://github.com/adbar/shoten.git


Usage

Input
Two possibilities for input data:


XML-TEI files as generated by trafilatura:

from shoten import gen_wordlist
myvocab = gen_wordlist(mydir, ['de', 'en'])





TSV-file contaning a word list: word form + TAB + date (YYYY-MM-DD format) + possible 3rd column (source)

from shoten import load_wordlist
myvocab = load_wordlist(myfile, ['de', 'en'])





Language codes: optional list of languages to be considered for lemmatization, ordered by relevance. ISO 639-1 codes, see the list of supported languages.
Optional argument maxdiff: maximum number of days to consider (default: 1000, i.e. going back up to 1000 days from today).


Filters
from shoten.filters import *

hapax_filter(myvocab, freqcount=2): (default frequency: <= 2)
shortness_filter(myvocab, threshold=20): length threshold in percent of word lengths
frequency_filter(myvocab, max_perc=50, min_perc=.001): maximum and minimum frequencies in percent
oldest_filter(myvocab, threshold=50): discard the oldest words (threshold in percent)
freshness_filter(myvocab, percentage=10): keep the X% freshest words
ngram_filter(myvocab, threshold=90, verbose=False): retains X% words based on character n-gram frequencies; runs out of memory if the vocabulary is too large (8 GB RAM recommended)
sources_freqfilter(myvocab, threshold=2): remove words which are only present in less than x sources
sources_filter(myvocab, myset): only keep the words for which the source contains a string listed in the input set
wordlist_filter(myvocab, mylist, keep_words=False): keep or discard words present in the input list

Reduce vocabulary size with a filter:
myvocab = oldest_filter(myvocab)
They can be chained:
myvocab = oldest_filter(shortness_filter(myvocab))


Output
# print one-by-one
for word in sorted(myvocab):
print(word)
# transfer to a list
results = [w for w in myvocab]


CLI
shoten --help



Additional information
Shoten = focal point in Japanese (焦点).
Project webpage: Webmonitor.

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.