shoten 0.1.0

Creator: bradpython12

Last updated: September 16, 2024

0 purchases

Free

Donate

Languages

Python

Description:

shoten 0.1.0

Helper functions to find word trends (i.e. extract tokens, lemmatize and filter).

Installation
pip/pip3 install -U git+https://github.com/adbar/shoten.git

Usage

Input
Two possibilities for input data:

XML-TEI files as generated by trafilatura:

from shoten import gen_wordlist
myvocab = gen_wordlist(mydir, ['de', 'en'])

TSV-file contaning a word list: word form + TAB + date (YYYY-MM-DD format) + possible 3rd column (source)

from shoten import load_wordlist
myvocab = load_wordlist(myfile, ['de', 'en'])

Language codes: optional list of languages to be considered for lemmatization, ordered by relevance. ISO 639-1 codes, see the list of supported languages.
Optional argument maxdiff: maximum number of days to consider (default: 1000, i.e. going back up to 1000 days from today).

Filters
from shoten.filters import *

hapax_filter(myvocab, freqcount=2): (default frequency: <= 2)
shortness_filter(myvocab, threshold=20): length threshold in percent of word lengths
frequency_filter(myvocab, max_perc=50, min_perc=.001): maximum and minimum frequencies in percent
oldest_filter(myvocab, threshold=50): discard the oldest words (threshold in percent)
freshness_filter(myvocab, percentage=10): keep the X% freshest words
ngram_filter(myvocab, threshold=90, verbose=False): retains X% words based on character n-gram frequencies; runs out of memory if the vocabulary is too large (8 GB RAM recommended)
sources_freqfilter(myvocab, threshold=2): remove words which are only present in less than x sources
sources_filter(myvocab, myset): only keep the words for which the source contains a string listed in the input set
wordlist_filter(myvocab, mylist, keep_words=False): keep or discard words present in the input list

Reduce vocabulary size with a filter:
myvocab = oldest_filter(myvocab)
They can be chained:
myvocab = oldest_filter(shortness_filter(myvocab))

Output
# print one-by-one
for word in sorted(myvocab):
print(word)
# transfer to a list
results = [w for w in myvocab]

CLI
shoten --help

Additional information
Shoten = focal point in Japanese (焦点).
Project webpage: Webmonitor.

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

There are no reviews.

zed

shoten 0.1.0

Languages

Categories

Description:

License

Share

Overview

What you can do with it

What you can't do with it

Related Products

Views For YouTube Bot writed on Python

AI-Web-Scraper

quivr

roop

More From This Creator

xdict 1.1.11

xdisplayselect 1.0.0

xfcs 1.1.6

xfcsdashboard 0.0.2

xfds 0.3.0

shoten 0.1.0

Languages

Categories

Description:

License

Share

Customer Reviews

License

Overview

What you can do with it

What you can't do with it

Related Products

Views For YouTube Bot writed on Python

AI-Web-Scraper

quivr

roop

zed

More From This Creator

xdict 1.1.11

xdisplayselect 1.0.0

xfcs 1.1.6

xfcsdashboard 0.0.2

xfds 0.3.0