0 purchases
shoten 0.1.0
Helper functions to find word trends (i.e. extract tokens, lemmatize and filter).
Installation
pip/pip3 install -U git+https://github.com/adbar/shoten.git
Usage
Input
Two possibilities for input data:
XML-TEI files as generated by trafilatura:
from shoten import gen_wordlist
myvocab = gen_wordlist(mydir, ['de', 'en'])
TSV-file contaning a word list: word form + TAB + date (YYYY-MM-DD format) + possible 3rd column (source)
from shoten import load_wordlist
myvocab = load_wordlist(myfile, ['de', 'en'])
Language codes: optional list of languages to be considered for lemmatization, ordered by relevance. ISO 639-1 codes, see the list of supported languages.
Optional argument maxdiff: maximum number of days to consider (default: 1000, i.e. going back up to 1000 days from today).
Filters
from shoten.filters import *
hapax_filter(myvocab, freqcount=2): (default frequency: <= 2)
shortness_filter(myvocab, threshold=20): length threshold in percent of word lengths
frequency_filter(myvocab, max_perc=50, min_perc=.001): maximum and minimum frequencies in percent
oldest_filter(myvocab, threshold=50): discard the oldest words (threshold in percent)
freshness_filter(myvocab, percentage=10): keep the X% freshest words
ngram_filter(myvocab, threshold=90, verbose=False): retains X% words based on character n-gram frequencies; runs out of memory if the vocabulary is too large (8 GB RAM recommended)
sources_freqfilter(myvocab, threshold=2): remove words which are only present in less than x sources
sources_filter(myvocab, myset): only keep the words for which the source contains a string listed in the input set
wordlist_filter(myvocab, mylist, keep_words=False): keep or discard words present in the input list
Reduce vocabulary size with a filter:
myvocab = oldest_filter(myvocab)
They can be chained:
myvocab = oldest_filter(shortness_filter(myvocab))
Output
# print one-by-one
for word in sorted(myvocab):
print(word)
# transfer to a list
results = [w for w in myvocab]
CLI
shoten --help
Additional information
Shoten = focal point in Japanese (焦点).
Project webpage: Webmonitor.
For personal and professional use. You cannot resell or redistribute these repositories in their original state.
There are no reviews.